The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards.
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings('ignore')
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
# Libraries to split data, impute missing values
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder as ohe # this will allow us to code categorical data in binary format
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from imblearn.under_sampling import RandomUnderSampler
#Loading dataset
dataframe=pd.read_csv("BankChurners.csv")
dataframe.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
dataframe.shape
(10127, 21)
dataframe.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.0 | 7.391776e+08 | 3.690378e+07 | 708082083.0 | 7.130368e+08 | 7.179264e+08 | 7.731435e+08 | 8.283431e+08 |
| Customer_Age | 10127.0 | 4.632596e+01 | 8.016814e+00 | 26.0 | 4.100000e+01 | 4.600000e+01 | 5.200000e+01 | 7.300000e+01 |
| Dependent_count | 10127.0 | 2.346203e+00 | 1.298908e+00 | 0.0 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 |
| Months_on_book | 10127.0 | 3.592841e+01 | 7.986416e+00 | 13.0 | 3.100000e+01 | 3.600000e+01 | 4.000000e+01 | 5.600000e+01 |
| Total_Relationship_Count | 10127.0 | 3.812580e+00 | 1.554408e+00 | 1.0 | 3.000000e+00 | 4.000000e+00 | 5.000000e+00 | 6.000000e+00 |
| Months_Inactive_12_mon | 10127.0 | 2.341167e+00 | 1.010622e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Contacts_Count_12_mon | 10127.0 | 2.455317e+00 | 1.106225e+00 | 0.0 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 6.000000e+00 |
| Credit_Limit | 10127.0 | 8.631954e+03 | 9.088777e+03 | 1438.3 | 2.555000e+03 | 4.549000e+03 | 1.106750e+04 | 3.451600e+04 |
| Total_Revolving_Bal | 10127.0 | 1.162814e+03 | 8.149873e+02 | 0.0 | 3.590000e+02 | 1.276000e+03 | 1.784000e+03 | 2.517000e+03 |
| Avg_Open_To_Buy | 10127.0 | 7.469140e+03 | 9.090685e+03 | 3.0 | 1.324500e+03 | 3.474000e+03 | 9.859000e+03 | 3.451600e+04 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 7.599407e-01 | 2.192068e-01 | 0.0 | 6.310000e-01 | 7.360000e-01 | 8.590000e-01 | 3.397000e+00 |
| Total_Trans_Amt | 10127.0 | 4.404086e+03 | 3.397129e+03 | 510.0 | 2.155500e+03 | 3.899000e+03 | 4.741000e+03 | 1.848400e+04 |
| Total_Trans_Ct | 10127.0 | 6.485869e+01 | 2.347257e+01 | 10.0 | 4.500000e+01 | 6.700000e+01 | 8.100000e+01 | 1.390000e+02 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 7.122224e-01 | 2.380861e-01 | 0.0 | 5.820000e-01 | 7.020000e-01 | 8.180000e-01 | 3.714000e+00 |
| Avg_Utilization_Ratio | 10127.0 | 2.748936e-01 | 2.756915e-01 | 0.0 | 2.300000e-02 | 1.760000e-01 | 5.030000e-01 | 9.990000e-01 |
We can see from the .shape function that there are 10,127 rows in the dataset. The describe function shows that all the counts = 10127; which means that there are no null values in the dataset. However, this does not mean that there are 0s or other place holders that represent missing values. We can see that there are a number of categorical variables, such as Gender, Education Level, Marital Status, etc. The dependent variable - the variable of interest - Attrition_Flag is in binary format according to the data dictionary, but we see above it is categorical. We will have to check there are only two values and we can convert to binary.
CustomerID appears to be a unique identifier, which would me we could drop this column. From the .describe function, we can see that a lot of the numerical variables have 0s as the minimum. We will have to decide whether these zeros are legitimate values or missing values that need to be replaced with different values (ie median or mean).
dataframe.isnull().sum() # we saw above that all counts = 10127, so these should all be zero...
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
dupes = dataframe.duplicated() #This will check to see if any of the rows are duplicates
sum(dupes)
0
There does not appear to be any duplicate rows. At this point, we will not any drop rows, but we may need to if there are rows with important information missing that cannot be easily replaced. Lets look at the datatypes of each column...
dataframe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 10127 non-null object 6 Marital_Status 10127 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
There are 21 categories of data and 10127 rows. Five additional columns appeared using .info() that were not displayed using .describe. This is because they are in object format, meaning they are also categorical variables. Before we make any changes to the dataset, lets create a copy called 'TB' - short for Thera Bank - this will allow us to keep the original dataset in case we need to recall it for some reason and gives us a shorthand name for easy manipulation.
TB = dataframe.copy() # This creates a copy of the dataframe
The first step is to remove the columns that do not provide utility. One such column is CLIENTNUM. CLIENTNUM is a unique key, but considering that it is simply a numbering for the clients we can remove this column and rely on the index instead to serve the same purpose. We could see using .describe, that the numbers are not significant and appear randomly assigned, this was confirmed in the data dictionary. By dropping this column, python will not perform unuseful analysis on this column (see .describe() above, for example).
TB.drop('CLIENTNUM',axis=1,inplace=True) # this drops the ID column since it is identical to the python index and is no longer needed
Our next step is to rename some of the columns. Many of the column names are long and can be shortened for easier EDA. This is merely a preference and will not affect the model.
TB.rename(columns={'Attrition_Flag':'Flag'}, inplace=True) #personal preference, shortening the column names
TB.rename(columns={'Customer_Age':'Age'}, inplace=True)
TB.rename(columns={'Dependent_count':'Dependents'}, inplace=True)
TB.rename(columns={'Education_Level':'Education'}, inplace=True)
TB.rename(columns={'Income_Category':'Income'}, inplace=True)
TB.rename(columns={'Card_Category':'Card'}, inplace=True)
TB.rename(columns={'Months_on_book':'Months'}, inplace=True)
TB.rename(columns={'Total_Relationship_Count':'Products_Held'}, inplace=True)
TB.rename(columns={'Months_Inactive_12_mon':'Months_Inactive'}, inplace=True)
TB.rename(columns={'Contacts_Count_12_mon':'Contacts'}, inplace=True)
TB.rename(columns={'Total_Revolving_Bal':'Balance'}, inplace=True)
TB.rename(columns={'Avg_Open_To_Buy':'Ave_Credit_Line'}, inplace=True)
TB.rename(columns={'Total_Amt_Chng_Q4_Q1':'Trans_Changes'}, inplace=True)
TB.rename(columns={'Total_Trans_Amt':'Trans_Totals'}, inplace=True)
TB.rename(columns={'Total_Trans_Ct':'Trans_Count'}, inplace=True)
TB.rename(columns={'Total_Ct_Chng_Q4_Q1':'Count_Changes'}, inplace=True)
TB.rename(columns={'Avg_Utilization_Ratio':'Ratio'}, inplace=True)
TB.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Flag 10127 non-null object 1 Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependents 10127 non-null int64 4 Education 10127 non-null object 5 Marital_Status 10127 non-null object 6 Income 10127 non-null object 7 Card 10127 non-null object 8 Months 10127 non-null int64 9 Products_Held 10127 non-null int64 10 Months_Inactive 10127 non-null int64 11 Contacts 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Balance 10127 non-null int64 14 Ave_Credit_Line 10127 non-null float64 15 Trans_Changes 10127 non-null float64 16 Trans_Totals 10127 non-null int64 17 Trans_Count 10127 non-null int64 18 Count_Changes 10127 non-null float64 19 Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
We now have columns names that are shorter, easier to work with, and in some cases, easier to understand. Some of the previous names, such as 'Total_Ct_Chng_Q4_Q1' were difficult to grasp.
Attrition Flag
TB.Flag.describe()
count 10127 unique 2 top Existing Customer freq 8500 Name: Flag, dtype: object
# This will show us the 2 unique contacts and their frequencies
TB["Flag"].value_counts()
Existing Customer 8500 Attrited Customer 1627 Name: Flag, dtype: int64
TB['Flag'].isnull().sum()
0
Attition Flag has only 2 distinct values. There are 0 missing. We see that the two options are Exisiting Customer or Attrited Customer. The data dictionary describes this variable as "Internal event (customer activity) variable - if the account is closed then 1 else 0". We should, therefore, change Exisiting Customer = 0 and Attrited Customer = 1. We will do this in the data engineering section below.
Age
TB.Age.describe()
count 10127.000000 mean 46.325960 std 8.016814 min 26.000000 25% 41.000000 50% 46.000000 75% 52.000000 max 73.000000 Name: Age, dtype: float64
TB['Age'].isnull().sum()
0
Age is a continuous variable. It ranges from 26 to 73. The mean is about 46 years. We can see that there are no missing values and no true outliers.
Gender
TB.Gender.describe()
count 10127 unique 2 top F freq 5358 Name: Gender, dtype: object
# This will show us the 2 unique contacts and their frequencies
TB["Gender"].value_counts()
F 5358 M 4769 Name: Gender, dtype: int64
TB['Gender'].isnull().sum()
0
Gender has only 2 distinct values. There are 0 missing. We see that the two options F or M. There appears to be more female customers in the dataset than males.
Dependents
TB.Dependents.describe()
count 10127.000000 mean 2.346203 std 1.298908 min 0.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 5.000000 Name: Dependents, dtype: float64
TB['Dependents'].isnull().sum()
0
Dependents is a discret, numerical variable. It ranges from 0 to 5. The mean is about 2 dependents. We can see that there are no missing values and no true outliers.
Education
TB.Education.describe()
count 10127 unique 7 top Graduate freq 3128 Name: Education, dtype: object
# This will show us the 7 unique contacts and their frequencies
TB["Education"].value_counts()
Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education, dtype: int64
TB['Education'].isnull().sum()
0
Education only has 7 distinct values. There are 0 missing, however, we see that there are over 1500 unknown. It may be necessary to change these values later. We will have to conduct more EDA first to be sure if this is the best option.
Marital Status
TB.Marital_Status.describe()
count 10127 unique 4 top Married freq 4687 Name: Marital_Status, dtype: object
# This will show us the 4 unique contacts and their frequencies
TB["Marital_Status"].value_counts()
Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64
TB['Marital_Status'].isnull().sum()
0
Marital Status has 4 distinct values. There are 0 missing, however, we see that there are over 700 unknown. It may be necessary to change these values later. We will have to conduct more EDA first to be sure if this is the best option.
Income
TB.Income.describe()
count 10127 unique 6 top Less than $40K freq 3561 Name: Income, dtype: object
# This will show us the 6 unique contacts and their frequencies
TB["Income"].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income, dtype: int64
TB['Income'].isnull().sum()
0
Income has 6 distinct values. There are 0 missing. Once again, we see that there are over 1100 missing values, however. We can also see that this category is a range of values. This means, there is an order or hierarchy and should probably be maintained for python (Which would not be able to identify that 120k + is more than 40k to 60k. We will fix this column in the data engineering section.
Card
TB.Card.describe()
count 10127 unique 4 top Blue freq 9436 Name: Card, dtype: object
# This will show us the 4 unique contacts and their frequencies
TB["Card"].value_counts()
Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card, dtype: int64
TB['Card'].isnull().sum()
0
Cards has 4 distinct values. There are 0 missing. We can probably infer that there is also a hierarchy or order to the cards, with Blue being the lowest and Platinum the highest. We will order this column in the data engineering section below.
Months
TB.Months.describe()
count 10127.000000 mean 35.928409 std 7.986416 min 13.000000 25% 31.000000 50% 36.000000 75% 40.000000 max 56.000000 Name: Months, dtype: float64
TB['Months'].isnull().sum()
0
Months is a discrete, numerical variable. It ranges from 13 to 56. The mean is about 36 months. We can see that there are no missing values and no true outliers.
Products_Held
TB.Products_Held.describe()
count 10127.000000 mean 3.812580 std 1.554408 min 1.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 6.000000 Name: Products_Held, dtype: float64
TB['Products_Held'].isnull().sum()
0
Products_Held is a discrete, numerical variable. It ranges from 1 to 6. The mean is about 4 products. We can see that there are no missing values and no true outliers.
Months_Inactive
TB.Months_Inactive.describe()
count 10127.000000 mean 2.341167 std 1.010622 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 6.000000 Name: Months_Inactive, dtype: float64
TB['Months_Inactive'].isnull().sum()
0
Months_Inactive is a discrete, numerical variable. It ranges from 0 to 6. The mean is about 2 months. We can see that there are no missing values and no true outliers.
Contacts
TB.Contacts.describe()
count 10127.000000 mean 2.455317 std 1.106225 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 6.000000 Name: Contacts, dtype: float64
TB['Contacts'].isnull().sum()
0
Contacts is a discrete, numerical variable. It ranges from 0 to 6. The mean is about 2 contactss. We can see that there are no missing values and no true outliers.
Credit_Limit
TB.Credit_Limit.describe()
count 10127.000000 mean 8631.953698 std 9088.776650 min 1438.300000 25% 2555.000000 50% 4549.000000 75% 11067.500000 max 34516.000000 Name: Credit_Limit, dtype: float64
TB['Credit_Limit'].isnull().sum()
0
Credit_Limit is a continuous variable. It ranges from 1438 to 34,516. The mean is about 8,632 dollars. We can see that there are no missing values. The max of almost 35k may be an outlier, we will have to check during univariate analysis
Balance
TB.Balance.describe()
count 10127.000000 mean 1162.814061 std 814.987335 min 0.000000 25% 359.000000 50% 1276.000000 75% 1784.000000 max 2517.000000 Name: Balance, dtype: float64
TB['Balance'].isnull().sum()
0
Balance is a continuous variable. It ranges from 0 to 2517. The mean is about 1163 dollars. We can see that there are no missing values and no true outliers.
Average Open to Buy Credit Line
TB.Ave_Credit_Line.describe()
count 10127.000000 mean 7469.139637 std 9090.685324 min 3.000000 25% 1324.500000 50% 3474.000000 75% 9859.000000 max 34516.000000 Name: Ave_Credit_Line, dtype: float64
TB['Ave_Credit_Line'].isnull().sum()
0
Average Credit Line is a continuous variable. It ranges from 1438 to 34,516. The mean is about 8,632 dollars. We can see that there are no missing values. The max of almost 35k may be an outlier, we will have to check during univariate analysis
Trans_Changes
TB.Trans_Changes.describe()
count 10127.000000 mean 0.759941 std 0.219207 min 0.000000 25% 0.631000 50% 0.736000 75% 0.859000 max 3.397000 Name: Trans_Changes, dtype: float64
TB['Trans_Changes'].isnull().sum()
0
Trans_Changes is a continuous variable. It ranges from 0 to 3.4. The mean is about 1 change. We can see that there are no missing values and no true outliers.
Trans_Totals
TB.Trans_Totals.describe()
count 10127.000000 mean 4404.086304 std 3397.129254 min 510.000000 25% 2155.500000 50% 3899.000000 75% 4741.000000 max 18484.000000 Name: Trans_Totals, dtype: float64
TB['Trans_Totals'].isnull().sum()
0
Trans_Totals is a continuous variable. It ranges from 510 to 18,484. The mean is about 4404 dollars. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis.
Trans_Count
TB.Trans_Count.describe()
count 10127.000000 mean 64.858695 std 23.472570 min 10.000000 25% 45.000000 50% 67.000000 75% 81.000000 max 139.000000 Name: Trans_Count, dtype: float64
TB['Trans_Count'].isnull().sum()
0
Trans_Count is a continuous variable. It ranges from 10 to 139. The mean is about 65 transactions. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis.
Count_Changes
TB.Count_Changes.describe()
count 10127.000000 mean 0.712222 std 0.238086 min 0.000000 25% 0.582000 50% 0.702000 75% 0.818000 max 3.714000 Name: Count_Changes, dtype: float64
TB['Count_Changes'].isnull().sum()
0
Count_Changes is a continuous variable. It ranges from 0 to about 4. The mean is about 0.71 d changes. We can see that there are no missing values and no true outliers.
Ratio
TB.Ratio.describe()
count 10127.000000 mean 0.274894 std 0.275691 min 0.000000 25% 0.023000 50% 0.176000 75% 0.503000 max 0.999000 Name: Ratio, dtype: float64
TB['Ratio'].isnull().sum()
0
Ratio is a continuous variable. It ranges from 0 to 0.999. The mean is about 0.27. We can see that there are no missing values. The max value may be an outlier, we will have to examine it more closely using univariate analysis. We can also see that the values for this ratio all lay between 0 and 1.
Let's create a checkpoint here so that we can come back to this dataset later on if we need to...
TB3 = TB.copy() # This creates a copy of the dataframe
The first step we will take during the Data Preprocessing stage is to convert some of the value types to values python can more readily understand. For example, the categorical values of Attrition Flag can be converted to binary. Some of the other categorical categories we identified as having a hierarchical order and should be changed as well.
Attrition Flag
TB["Flag"].value_counts()
Existing Customer 8500 Attrited Customer 1627 Name: Flag, dtype: int64
TB['Flag'] = TB['Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1,
})
#Changes the categorical flag values to binary format
The Attrition Flag values were changed to binary format (0,1) in order to match the data dictionary provided by the bank. It is also much easier to run logistic regression using python when the dependent variables are in binary format
Gender
TB["Gender"].value_counts()
F 5358 M 4769 Name: Gender, dtype: int64
TB['Gender'] = TB['Gender'].map({'M': 0, 'F': 1,
})
#Changes the categorical Gender values to binary format
The Gender values are not ranked, however, they can be changed to binary so that we can look for correlations during EDA multivariate analysis. We are basically now asking the question: "Is the customer female?" 0 = no, 1 = yes.
Education
TB["Education"].value_counts()
Graduate 3128 High School 2013 Unknown 1519 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education, dtype: int64
TB['Education'] = TB['Education'].map({'Unknown': -1, 'Uneducated': 0, 'High School': 1, 'College': 2, 'Graduate': 3,
'Post-Graduate': 4, 'Doctorate': 5,
})
#Changes the categorical city values to correctly ranked labels
The education values were clearly ordered values. Uneducated would be considered the lowest level achieved, followed by: High School, College, Graduate, Post Graduate and Doctorate. They have been ordered accordingly.
Note: for the time being, we are leaving the 'unknown' values as -1. During the univariate/multivariate EDA stages we may find compelling reasons to replace -1s with more appropriate values.
Income
TB["Income"].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income, dtype: int64
TB['Income'] = TB['Income'].map({'Unknown': -1, 'Less than $40K': 0, '$40K - $60K': 1, '$60K - $80K': 2,
'$80K - $120K': 3, '$120K +': 4,
})
#Changes the categorical city values to correctly ranked labels
The income values were clearly ordered values. Customers who make less than 40k per year would be considered the lowest level, followed by: customers who make between 40-60k, 60-80k, 80-120k and finally, 120k +. They have been ordered accordingly.
Note: for the time being, we are leaving the 'unknown' values as -1. During the univariate/multivariate EDA stages we may find compelling reasons to replace -1s with more appropriate values.
Card
TB["Card"].value_counts()
Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card, dtype: int64
TB['Card'] = TB['Card'].map({'Blue': 0, 'Silver': 1, 'Gold': 2, 'Platinum': 3
})
#Changes the categorical city values to correctly ranked labels
The card values were clearly ordered values. Blue would be considered the lowest, or entry level card, followed by: Silver, Gold and then Platinum. They have been ordered accordingly.
TB.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Flag 10127 non-null int64 1 Age 10127 non-null int64 2 Gender 10127 non-null int64 3 Dependents 10127 non-null int64 4 Education 10127 non-null int64 5 Marital_Status 10127 non-null object 6 Income 10127 non-null int64 7 Card 10127 non-null int64 8 Months 10127 non-null int64 9 Products_Held 10127 non-null int64 10 Months_Inactive 10127 non-null int64 11 Contacts 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Balance 10127 non-null int64 14 Ave_Credit_Line 10127 non-null float64 15 Trans_Changes 10127 non-null float64 16 Trans_Totals 10127 non-null int64 17 Trans_Count 10127 non-null int64 18 Count_Changes 10127 non-null float64 19 Ratio 10127 non-null float64 dtypes: float64(5), int64(14), object(1) memory usage: 1.5+ MB
We can see that Marital Status is the only object left in the dataset. This is fine because there are different values for the variable and they are not ordered. We will not be able to see correlations properly if we change them to numerical values so it is best to leave as categorical data. We will have to OneHotEncode these values before we build our model.
def histogram_boxplot(data, xlabel=None, title=None, font_scale=2, figsize=(15,7), bins=None):
mean = np.mean(data)
sns.set(font_scale=font_scale) # setting the font scale of the seaborn
f2, (ax_box2, ax_hist2) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.25, .75)}, figsize=figsize)
sns.boxplot(data, ax=ax_box2, showmeans=True,color="violet") #boxplot will be created and a star will indicate the mean value
sns.distplot(data,kde=False, ax=ax_hist2, bins=bins,palette="winter") if bins else sns.distplot(data,kde=False, ax=ax_hist2, bins=None)
ax_hist2.axvline(mean, color='g', linestyle='--') #mean will show as vertical line in the histogram
if xlabel: ax_hist2.set(xlabel=xlabel) #xlabel
if title: ax_box2.set(title=title) # title of the graph
plt.show() # for plotting the graph
def perc_on_bar(plot, feature):
'''
plot
feature: categorical feature
the function wont work if a colume is passed in hue parameter
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # height of the plot
ax.annotate(percentage, (x,y), size = 20) # annotate the percentage
plt.show() # shows the plot
Attrition Flag
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Flag'], palette='winter')
perc_on_bar(ax,TB['Flag'])
Age
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Age'], palette='winter')
median = np.median(TB.Age) # find the median income of all customers
mean = np.mean(TB.Age) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 46.32596030413745 The median equals 46.0
Gender
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Gender'], palette='winter')
perc_on_bar(ax,TB['Gender'])
Dependents
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Dependents'], palette='winter')
perc_on_bar(ax,TB['Dependents'])
median = np.median(TB.Dependents) # find the median income of all customers
mean = np.mean(TB.Dependents) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 2.3462032191172115 The median equals 2.0
Education
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Education'], palette='winter')
perc_on_bar(ax,TB['Education'])
median = np.median(TB.Education) # find the median income of all customers
mean = np.mean(TB.Education) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 1.6019551693492644 The median equals 2.0
Marital Status
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Marital_Status'], palette='winter')
perc_on_bar(ax,TB['Marital_Status'])
Income
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Income'], palette='winter')
perc_on_bar(ax,TB['Income'])
median = np.median(TB.Income) # find the median income of all customers
mean = np.mean(TB.Income) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 1.0857114644020933 The median equals 1.0
Card
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Card'], palette='winter')
perc_on_bar(ax,TB['Card'])
median = np.median(TB.Card) # find the median income of all customers
mean = np.mean(TB.Card) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 0.08363779994075245 The median equals 0.0
Months
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Months'], palette='winter')
median = np.median(TB.Months) # find the median income of all customers
mean = np.mean(TB.Months) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 35.928409203120374 The median equals 36.0
print('Kurtosis of Months variable : {}'.format(TB['Months'].kurt()))
Kurtosis of Months variable : 0.40010012019986707
Products_Held
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Products_Held'], palette='winter')
perc_on_bar(ax,TB['Products_Held'])
median = np.median(TB.Products_Held) # find the median income of all customers
mean = np.mean(TB.Products_Held) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 3.8125802310654686 The median equals 4.0
Contacts
plt.figure(figsize=(15,5))
ax = sns.countplot(TB['Contacts'], palette='winter')
perc_on_bar(ax,TB['Contacts'])
median = np.median(TB.Contacts) # find the median income of all customers
mean = np.mean(TB.Contacts) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 2.4553174681544387 The median equals 2.0
Credit_Limit
histogram_boxplot(TB['Credit_Limit'])
median = np.median(TB.Credit_Limit) # find the median income of all customers
mean = np.mean(TB.Credit_Limit) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 8631.953698034848 The median equals 4549.0
Balance
histogram_boxplot(TB['Balance'])
median = np.median(TB.Balance) # find the median income of all customers
mean = np.mean(TB.Balance) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 1162.8140614199665 The median equals 1276.0
Ave_Credit_Line
histogram_boxplot(TB['Ave_Credit_Line'])
median = np.median(TB.Ave_Credit_Line) # find the median income of all customers
mean = np.mean(TB.Ave_Credit_Line) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 7469.139636614887 The median equals 3474.0
Trans_Changes
histogram_boxplot(TB['Trans_Changes'])
median = np.median(TB.Trans_Changes) # find the median income of all customers
mean = np.mean(TB.Trans_Changes) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 0.7599406536980376 The median equals 0.736
TB.loc[TB['Trans_Changes'] >= 3]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 0 | 37 | 0 | 3 | 0 | Single | 2 | 0 | 36 | 5 | 2 | 0 | 22352.0 | 2517 | 19835.0 | 3.355 | 1350 | 24 | 1.182 | 0.113 |
| 12 | 0 | 56 | 0 | 1 | 2 | Single | 3 | 0 | 36 | 3 | 6 | 0 | 11751.0 | 0 | 11751.0 | 3.397 | 1539 | 17 | 3.250 | 0.000 |
Trans_Totals
histogram_boxplot(TB['Trans_Totals'])
median = np.median(TB.Trans_Totals) # find the median income of all customers
mean = np.mean(TB.Trans_Totals) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 4404.086303939963 The median equals 3899.0
Trans_Count
histogram_boxplot(TB['Trans_Count'])
median = np.median(TB.Trans_Count) # find the median income of all customers
mean = np.mean(TB.Trans_Count) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 64.85869457884863 The median equals 67.0
TB.loc[TB['Trans_Count'] >= 135]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9324 | 0 | 41 | 0 | 3 | -1 | Married | 4 | 0 | 33 | 2 | 4 | 3 | 34516.0 | 638 | 33878.0 | 0.724 | 13085 | 139 | 0.675 | 0.018 |
| 9586 | 0 | 56 | 1 | 1 | 1 | Married | -1 | 0 | 49 | 1 | 2 | 1 | 17542.0 | 2517 | 15025.0 | 0.800 | 13939 | 138 | 0.792 | 0.143 |
Count_Changes
histogram_boxplot(TB['Count_Changes'])
median = np.median(TB.Count_Changes) # find the median income of all customers
mean = np.mean(TB.Count_Changes) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 0.7122223758269962 The median equals 0.7020000000000001
TB.loc[TB['Count_Changes'] >= 3.5]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 49 | 1 | 5 | 3 | Single | 0 | 0 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 269 | 0 | 54 | 0 | 5 | 3 | Married | 2 | 0 | 38 | 3 | 3 | 3 | 2290.0 | 1434 | 856.0 | 0.923 | 1119 | 18 | 3.500 | 0.626 |
| 773 | 0 | 61 | 0 | 0 | 4 | Married | -1 | 0 | 53 | 6 | 2 | 3 | 14434.0 | 1927 | 12507.0 | 2.675 | 1731 | 32 | 3.571 | 0.134 |
Ratio
histogram_boxplot(TB['Ratio'])
median = np.median(TB.Ratio) # find the median income of all customers
mean = np.mean(TB.Ratio) # find the mean income of all customers
print('The mean equals', mean)
print('The median equals', median)
The mean equals 0.2748935518909845 The median equals 0.17600000000000002
We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.
TB.corr() # creates a table of how the numerical values are correlated
| Flag | Age | Gender | Dependents | Education | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Flag | 1.000000 | 0.018203 | 0.037272 | 0.018991 | 0.008796 | -0.013577 | 0.002354 | 0.013687 | -0.150005 | 0.152449 | 0.204491 | -0.023873 | -0.263053 | -0.000285 | -0.131063 | -0.168598 | -0.371403 | -0.290054 | -0.178410 |
| Age | 0.018203 | 1.000000 | 0.017312 | -0.122254 | -0.002369 | 0.023508 | -0.018235 | 0.788912 | -0.010931 | 0.054361 | -0.018452 | 0.002476 | 0.014780 | 0.001151 | -0.062042 | -0.046446 | -0.067097 | -0.012143 | 0.007114 |
| Gender | 0.037272 | 0.017312 | 1.000000 | -0.004563 | 0.005087 | -0.786608 | -0.080093 | 0.006728 | -0.003157 | 0.011163 | -0.039987 | -0.420806 | -0.029658 | -0.418059 | -0.026712 | -0.024890 | 0.067454 | 0.005800 | 0.257851 |
| Dependents | 0.018991 | -0.122254 | -0.004563 | 1.000000 | 0.000472 | 0.066278 | 0.030469 | -0.103062 | -0.039076 | -0.010768 | -0.040505 | 0.068065 | -0.002688 | 0.068291 | -0.035439 | 0.025046 | 0.049912 | 0.011087 | -0.037135 |
| Education | 0.008796 | -0.002369 | 0.005087 | 0.000472 | 1.000000 | -0.011677 | 0.014989 | 0.006613 | 0.000766 | 0.005761 | -0.006280 | -0.002354 | -0.006800 | -0.001743 | -0.010040 | -0.007460 | -0.004307 | -0.016692 | -0.001849 |
| Income | -0.013577 | 0.023508 | -0.786608 | 0.066278 | -0.011677 | 1.000000 | 0.077326 | 0.022122 | -0.003202 | -0.016310 | 0.023113 | 0.475972 | 0.034718 | 0.472760 | 0.011352 | 0.019651 | -0.054569 | -0.012657 | -0.246476 |
| Card | 0.002354 | -0.018235 | -0.080093 | 0.030469 | 0.014989 | 0.077326 | 1.000000 | -0.012535 | -0.094077 | -0.014629 | -0.000442 | 0.492446 | 0.026304 | 0.489985 | 0.007385 | 0.196003 | 0.134275 | -0.007261 | -0.198711 |
| Months | 0.013687 | 0.788912 | 0.006728 | -0.103062 | 0.006613 | 0.022122 | -0.012535 | 1.000000 | -0.009203 | 0.074164 | -0.010774 | 0.007507 | 0.008623 | 0.006732 | -0.048959 | -0.038591 | -0.049819 | -0.014072 | -0.007541 |
| Products_Held | -0.150005 | -0.010931 | -0.003157 | -0.039076 | 0.000766 | -0.003202 | -0.094077 | -0.009203 | 1.000000 | -0.003675 | 0.055203 | -0.071386 | 0.013726 | -0.072601 | 0.050119 | -0.347229 | -0.241891 | 0.040831 | 0.067663 |
| Months_Inactive | 0.152449 | 0.054361 | 0.011163 | -0.010768 | 0.005761 | -0.016310 | -0.014629 | 0.074164 | -0.003675 | 1.000000 | 0.029493 | -0.020394 | -0.042210 | -0.016605 | -0.032247 | -0.036982 | -0.042787 | -0.038989 | -0.007503 |
| Contacts | 0.204491 | -0.018452 | -0.039987 | -0.040505 | -0.006280 | 0.023113 | -0.000442 | -0.010774 | 0.055203 | 0.029493 | 1.000000 | 0.020817 | -0.053913 | 0.025646 | -0.024445 | -0.112774 | -0.152213 | -0.094997 | -0.055471 |
| Credit_Limit | -0.023873 | 0.002476 | -0.420806 | 0.068065 | -0.002354 | 0.475972 | 0.492446 | 0.007507 | -0.071386 | -0.020394 | 0.020817 | 1.000000 | 0.042493 | 0.995981 | 0.012813 | 0.171730 | 0.075927 | -0.002020 | -0.482965 |
| Balance | -0.263053 | 0.014780 | -0.029658 | -0.002688 | -0.006800 | 0.034718 | 0.026304 | 0.008623 | 0.013726 | -0.042210 | -0.053913 | 0.042493 | 1.000000 | -0.047167 | 0.058174 | 0.064370 | 0.056060 | 0.089861 | 0.624022 |
| Ave_Credit_Line | -0.000285 | 0.001151 | -0.418059 | 0.068291 | -0.001743 | 0.472760 | 0.489985 | 0.006732 | -0.072601 | -0.016605 | 0.025646 | 0.995981 | -0.047167 | 1.000000 | 0.007595 | 0.165923 | 0.070885 | -0.010076 | -0.538808 |
| Trans_Changes | -0.131063 | -0.062042 | -0.026712 | -0.035439 | -0.010040 | 0.011352 | 0.007385 | -0.048959 | 0.050119 | -0.032247 | -0.024445 | 0.012813 | 0.058174 | 0.007595 | 1.000000 | 0.039678 | 0.005469 | 0.384189 | 0.035235 |
| Trans_Totals | -0.168598 | -0.046446 | -0.024890 | 0.025046 | -0.007460 | 0.019651 | 0.196003 | -0.038591 | -0.347229 | -0.036982 | -0.112774 | 0.171730 | 0.064370 | 0.165923 | 0.039678 | 1.000000 | 0.807192 | 0.085581 | -0.083034 |
| Trans_Count | -0.371403 | -0.067097 | 0.067454 | 0.049912 | -0.004307 | -0.054569 | 0.134275 | -0.049819 | -0.241891 | -0.042787 | -0.152213 | 0.075927 | 0.056060 | 0.070885 | 0.005469 | 0.807192 | 1.000000 | 0.112324 | 0.002838 |
| Count_Changes | -0.290054 | -0.012143 | 0.005800 | 0.011087 | -0.016692 | -0.012657 | -0.007261 | -0.014072 | 0.040831 | -0.038989 | -0.094997 | -0.002020 | 0.089861 | -0.010076 | 0.384189 | 0.085581 | 0.112324 | 1.000000 | 0.074143 |
| Ratio | -0.178410 | 0.007114 | 0.257851 | -0.037135 | -0.001849 | -0.246476 | -0.198711 | -0.007541 | 0.067663 | -0.007503 | -0.055471 | -0.482965 | 0.624022 | -0.538808 | 0.035235 | -0.083034 | 0.002838 | 0.074143 | 1.000000 |
plt.figure(figsize=(35,20))
sns.heatmap(TB.corr(), annot=True)
plt.show()
# Pairplot using sns
sns.pairplot(TB)
<seaborn.axisgrid.PairGrid at 0x2a0f29edcd0>
Due to the numerous variables in the data set, we will first focus on the relationships between the dependent variable, Flag Attrition, and variables we deem important from initial EDA. Second, we will look at relationships of numerical vs numerical values that have moderate or strong correlations. Third, we will focus on numerical vs categorical variables of interest, and finally, we will compare categorical vs other categorical variables.
from numpy.polynomial.polynomial import polyfit
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x):
sns.set(palette='nipy_spectral')
tab1 = pd.crosstab(x,TB['Flag'],margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(x,TB['Flag'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(10,5))
plt.legend(loc='lower left', frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()
Dependent Variable vs Numerical Variables
Flag Attriction is the dependent variable and thus the most relevant to the model. We will look at relationships between this variable and other numerical variables first.
# Flag Attrition vs Contacts
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TB, x = 'Flag', hue = 'Contacts'); #should give a visual frequency graph
stacked_plot(TB['Contacts'])
Flag 0 1 All Contacts 0 392 7 399 1 1391 108 1499 2 2824 403 3227 3 2699 681 3380 4 1077 315 1392 5 117 59 176 6 0 54 54 All 8500 1627 10127 ------------------------------------------------------------------------------------------------------------------------
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Flag", y="Contacts", data=TB);
# Flag Attrition vs Months_Inactive
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TB, x = 'Flag', hue = 'Months_Inactive'); #should give a visual frequency graph
stacked_plot(TB['Months_Inactive'])
Flag 0 1 All Months_Inactive 0 14 15 29 1 2133 100 2233 2 2777 505 3282 3 3020 826 3846 4 305 130 435 5 146 32 178 6 105 19 124 All 8500 1627 10127 ------------------------------------------------------------------------------------------------------------------------
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Flag", y="Months_Inactive", data=TB);
# Flag Attrition vs Transaction Counts
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TB, x = 'Flag', hue = 'Trans_Count'); #should give a visual frequency graph
stacked_plot(TB['Trans_Count'])
Flag 0 1 All Trans_Count 10 0 4 4 11 1 1 2 12 0 4 4 13 2 3 5 14 1 8 9 ... ... ... ... 132 1 0 1 134 1 0 1 138 1 0 1 139 1 0 1 All 8500 1627 10127 [127 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Flag", y="Trans_Count", data=TB);
#Flag Attrition vs Count Changes
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Flag", y="Count_Changes", data=TB);
#Flag Attrition vs Balance
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Flag", y="Balance", data=TB);
# Flag Attrition vs Marital Status
figure = plt.figure(figsize=(25,5))
sns.countplot(data = TB, x = 'Flag', hue = 'Marital_Status'); #should give a visual frequency graph
stacked_plot(TB['Marital_Status'])
Flag 0 1 All Marital_Status Divorced 627 121 748 Married 3978 709 4687 Single 3275 668 3943 Unknown 620 129 749 All 8500 1627 10127 ------------------------------------------------------------------------------------------------------------------------
Numerical vs Numerical Variables of Interest
#Trans Totals vs Trans Counts
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Trans_Totals", y="Trans_Count", data=TB);
#Age vs Months
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Age", y="Months", data=TB);
#Balance vs Ratio
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Balance", y="Ratio", data=TB);
Trans_Totals and Trans_Counts have strong, positive correlation, so does Age and Months, as does Balance and Ratio; Income and gender have strong, negative correlation
Categorical vs Numerical Variables of Interest
The three variables that are of biggest concern to us, at the moment, are: Education, Income and Marital Status. These three variables have many unknown values that need to be cleaned. EDA may help us see patterns in the data so that we can make better choices when replacing the unknowns with imputed values.
Income
# Income vs Gender
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Gender", data=TB);
sns.stripplot(data = TB, x = 'Income', y = 'Gender'); # this will graph income vs gender
# Income vs Credit Limit
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Credit_Limit", data=TB);
We can start to see some patterns in the data. First, anyone with credit limits over about 16k have incomes of 1 or more, similarly, people with credit limits of around 25k + have incomes of 2 or more. Also, we notice that ALL women have incomes of 0 or 1, while just the men have incomes ranging from 0 to 4
# Income vs Card
sns.stripplot(data = TB, x = 'Income', y = 'Card'); # this will graph income vs Card
# Income vs Products Held
sns.stripplot(data = TB, x = 'Income', y = 'Products_Held'); # this will graph income vs Products Held
# Income vs Contacts
sns.stripplot(data = TB, x = 'Income', y = 'Contacts'); # this will graph income vs Contact
# Income vs Trans Counts
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Trans_Count", data=TB);
# Income vs Ratio
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Income", y="Ratio", data=TB);
The above graphs do not give us more insight, it seems that the income levels are almost evenly spread across most of these other variables. We could have assumed that from the lack of correlations found between these variables, but it is good to double check. It seems that our best bet is to use Gender and Credit Limit to create a rule to replace the missing Income values.
Education
# Education vs Gender
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Education", y="Gender", data=TB);
sns.stripplot(data = TB, x = 'Education', y = 'Gender'); # this will graph Education vs gender
# Education vs Credit Limit
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Education", y="Credit_Limit", data=TB);
# Education vs Card
sns.stripplot(data = TB, x = 'Education', y = 'Card'); # this will graph Education vs Card
# Education vs Products Held
sns.stripplot(data = TB, x = 'Education', y = 'Products_Held'); # this will graph Education vs Products Held
# Education vs Contacts
sns.stripplot(data = TB, x = 'Education', y = 'Contacts'); # this will graph Education vs Contact
# Education vs Trans Counts
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Education", y="Trans_Count", data=TB);
# Education vs Ratio
figure = plt.figure(figsize=(15,5)) # adding a trend line will give us a better understanding of correlation
sns.regplot(x="Education", y="Ratio", data=TB);
The above graphs give us little insight into discernable patterns. It seems that customers of varying education levels are pretty representative of all other variables. It is difficult to find a pattern to follow in order to write a command to impute 'better' guesses for the unknown values. We may just need to use the median or the mode of the entire column.
Marital Status
# Marital Status vs Gender
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Gender'); # this will graph Marital_Status vs gender
# Marital_Status vs Credit Limit
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Gender'); # this will graph Marital_Status vs gender
# Marital_Status vs Card
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Card'); # this will graph Marital_Status vs Card
# Marital_Status vs Products Held
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Products_Held'); # this will graph Marital_Status vs Products Held
# Marital_Status vs Contacts
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Contacts'); # this will graph Marital_Status vs Contact
# Marital_Status vs Trans Counts
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Trans_Count'); # this will graph Marital_Status vs Contact
# Marital_Status vs Ratio
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Ratio'); # this will graph Marital_Status vs Contact
Its difficult to be sure, but so far, it appears that Unkownn is behaving most closely with Divorced values. We can see similar shapes in the last two charts. Since this variable is unordered, categorical, we cannot replace it with a mean, median or mode. The best option would be to find clusters of data that act similarly and replace the matching unknown with the value of similar clusters. However, if all of unknown is acting as a cluster similar to Divorced, this would be an easy switch. We need to do more analysis to be sure...
df1 = TB.copy() # This creates a copy of the dataframe, which we can manipulate to include ONLY divorced and unknown
df1.shape
(10127, 20)
for index in range(0,len(df1)): # The for loop will increase the ID value for each recor
if df1['Marital_Status'].loc[index] == 'Married':
df1.drop([index], axis=0, inplace=True)
# Slightly counter-intuitive, but if the loop recognizes that Smoker = no, it will delete the row. This leaves only rows with Smoker = yes
df1.reset_index(inplace=True) # This code resets the index so we can perform a second loop
df1.shape
(5440, 21)
for index in range(0,len(df1)): # The for loop will increase the ID value for each recor
if df1['Marital_Status'].loc[index] == 'Single':
df1.drop([index], axis=0, inplace=True)
# Slightly counter-intuitive, but if the loop recognizes that Smoker = no, it will delete the row. This leaves only rows with Smoker = yes
df1.reset_index(inplace=True) # This code resets the index so we can perform a second loop
df1.shape
(1497, 22)
# This will show us the 2 unique contacts and their frequencies; we know from above Unknown = 749, Divorced = 748
df1["Marital_Status"].value_counts()
Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64
We can now change the values to binary format, essentially asking the question, "do we know if the customer is divorced?" 0 = No, we do not know; 1 = yes, we know they are divorced. This simple convertion will allow us to do more EDA and see things like correlations. If there are no correlations between values and Unknown and Divorced, we can assume they behave very similarly and can replace one with the other...
df1['Marital_Status'] = df1['Marital_Status'].map({'Unknown': 0, 'Divorced': 1,
})
#Changes the categorical marital status values to binary format
We can start by checking the correlation between the numerical data variables by using .corr and a heatmap function.
df1.drop(['level_0'], axis=1, inplace = True)
df1.drop(['index'], axis=1, inplace = True)
df1.corr() # creates a table of how the numerical values are correlated
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Flag | 1.000000 | 0.044995 | 0.022967 | 0.007866 | 0.025973 | -0.014029 | -0.015654 | -0.018335 | 0.030573 | -0.138428 | 0.174530 | 0.197877 | -0.033809 | -0.263806 | -0.011060 | -0.169101 | -0.196646 | -0.428582 | -0.332293 | -0.178939 |
| Age | 0.044995 | 1.000000 | 0.050217 | -0.023046 | -0.030082 | -0.031343 | 0.051894 | -0.032406 | 0.750799 | 0.022402 | 0.069599 | -0.018800 | -0.023736 | 0.027328 | -0.026077 | -0.018868 | -0.024960 | -0.051322 | -0.066768 | 0.039796 |
| Gender | 0.022967 | 0.050217 | 1.000000 | 0.017996 | 0.016763 | 0.030120 | -0.781016 | -0.075864 | 0.011659 | 0.010828 | 0.025717 | -0.037661 | -0.410813 | 0.045016 | -0.414444 | -0.005558 | -0.005492 | 0.085545 | 0.026816 | 0.290493 |
| Dependents | 0.007866 | -0.023046 | 0.017996 | 1.000000 | 0.022274 | -0.063103 | 0.041592 | -0.024958 | -0.020493 | 0.009142 | -0.016657 | -0.035110 | 0.008758 | 0.031478 | 0.006040 | -0.059394 | -0.004638 | -0.017449 | -0.001552 | 0.034029 |
| Education | 0.025973 | -0.030082 | 0.016763 | 0.022274 | 1.000000 | 0.017220 | -0.061747 | 0.018887 | 0.002906 | -0.009188 | 0.024914 | 0.042156 | -0.025408 | -0.011343 | -0.024415 | -0.043796 | 0.020058 | 0.025216 | -0.042107 | 0.014897 |
| Marital_Status | -0.014029 | -0.031343 | 0.030120 | -0.063103 | 0.017220 | 1.000000 | -0.032151 | -0.044161 | -0.002993 | 0.032488 | 0.011916 | 0.002104 | -0.004552 | -0.001076 | -0.004457 | 0.015709 | -0.028396 | -0.004936 | -0.029697 | 0.001967 |
| Income | -0.015654 | 0.051894 | -0.781016 | 0.041592 | -0.061747 | -0.032151 | 1.000000 | 0.041431 | 0.044119 | -0.023083 | -0.060840 | 0.009587 | 0.439427 | -0.027426 | 0.441525 | 0.015480 | 0.010597 | -0.078717 | 0.004408 | -0.249058 |
| Card | -0.018335 | -0.032406 | -0.075864 | -0.024958 | 0.018887 | -0.044161 | 0.041431 | 1.000000 | -0.025164 | -0.106850 | -0.056799 | -0.009313 | 0.483196 | 0.021227 | 0.481076 | 0.014840 | 0.149637 | 0.056470 | -0.032719 | -0.205858 |
| Months | 0.030573 | 0.750799 | 0.011659 | -0.020493 | 0.002906 | -0.002993 | 0.044119 | -0.025164 | 1.000000 | 0.021421 | 0.104106 | -0.035304 | -0.006756 | 0.025436 | -0.008944 | -0.024402 | -0.032055 | -0.046207 | -0.046117 | 0.006952 |
| Products_Held | -0.138428 | 0.022402 | 0.010828 | 0.009142 | -0.009188 | 0.032488 | -0.023083 | -0.106850 | 0.021421 | 1.000000 | -0.008546 | 0.037034 | -0.071790 | 0.010914 | -0.072687 | 0.027562 | -0.358310 | -0.214318 | 0.077532 | 0.094637 |
| Months_Inactive | 0.174530 | 0.069599 | 0.025717 | -0.016657 | 0.024914 | 0.011916 | -0.060840 | -0.056799 | 0.104106 | -0.008546 | 1.000000 | 0.043140 | -0.042176 | -0.068470 | -0.036251 | -0.059074 | -0.056727 | -0.067301 | -0.102526 | -0.043026 |
| Contacts | 0.197877 | -0.018800 | -0.037661 | -0.035110 | 0.042156 | 0.002104 | 0.009587 | -0.009313 | -0.035304 | 0.037034 | 0.043140 | 1.000000 | 0.017247 | -0.097506 | 0.025638 | -0.057720 | -0.089430 | -0.119225 | -0.106797 | -0.083254 |
| Credit_Limit | -0.033809 | -0.023736 | -0.410813 | 0.008758 | -0.025408 | -0.004552 | 0.439427 | 0.483196 | -0.006756 | -0.071790 | -0.042176 | 0.017247 | 1.000000 | 0.036113 | 0.996286 | 0.014476 | 0.111037 | -0.002744 | 0.003354 | -0.494247 |
| Balance | -0.263806 | 0.027328 | 0.045016 | 0.031478 | -0.011343 | -0.001076 | -0.027426 | 0.021227 | 0.025436 | 0.010914 | -0.068470 | -0.097506 | 0.036113 | 1.000000 | -0.050068 | 0.092343 | 0.129124 | 0.131853 | 0.110024 | 0.617419 |
| Ave_Credit_Line | -0.011060 | -0.026077 | -0.414444 | 0.006040 | -0.024415 | -0.004457 | 0.441525 | 0.481076 | -0.008944 | -0.072687 | -0.036251 | 0.025638 | 0.996286 | -0.050068 | 1.000000 | 0.006511 | 0.099845 | -0.014103 | -0.006127 | -0.547145 |
| Trans_Changes | -0.169101 | -0.018868 | -0.005558 | -0.059394 | -0.043796 | 0.015709 | 0.015480 | 0.014840 | -0.024402 | 0.027562 | -0.059074 | -0.057720 | 0.014476 | 0.092343 | 0.006511 | 1.000000 | 0.070287 | 0.038620 | 0.309230 | 0.065391 |
| Trans_Totals | -0.196646 | -0.024960 | -0.005492 | -0.004638 | 0.020058 | -0.028396 | 0.010597 | 0.149637 | -0.032055 | -0.358310 | -0.056727 | -0.089430 | 0.111037 | 0.129124 | 0.099845 | 0.070287 | 1.000000 | 0.792019 | 0.062769 | -0.014045 |
| Trans_Count | -0.428582 | -0.051322 | 0.085545 | -0.017449 | 0.025216 | -0.004936 | -0.078717 | 0.056470 | -0.046207 | -0.214318 | -0.067301 | -0.119225 | -0.002744 | 0.131853 | -0.014103 | 0.038620 | 0.792019 | 1.000000 | 0.093603 | 0.097293 |
| Count_Changes | -0.332293 | -0.066768 | 0.026816 | -0.001552 | -0.042107 | -0.029697 | 0.004408 | -0.032719 | -0.046117 | 0.077532 | -0.102526 | -0.106797 | 0.003354 | 0.110024 | -0.006127 | 0.309230 | 0.062769 | 0.093603 | 1.000000 | 0.087798 |
| Ratio | -0.178939 | 0.039796 | 0.290493 | 0.034029 | 0.014897 | 0.001967 | -0.249058 | -0.205858 | 0.006952 | 0.094637 | -0.043026 | -0.083254 | -0.494247 | 0.617419 | -0.547145 | 0.065391 | -0.014045 | 0.097293 | 0.087798 | 1.000000 |
There is almost no correlation between Marital_Status and any other variable. Let's perform some EDA on this dataframe..
# Divorced/Unknown vs Age
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Age'); # this will graph Marital_Status vs Contact
# Divorced/Unknown vs Gender
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Gender'); # this will graph Marital_Status vs Contact
# Divorced/Unknown vs Dependents
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Dependents'); # this will graph Marital_Status vs Contact
# Divorced/Unknown vs Education
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Education'); # this will graph Marital_Status vs Contact
# Divorced/Unknown vs Months
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Months'); # this will graph Marital_Status vs Contact
# Divorced/Unknown vs Balance
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Balance'); # this will graph Marital_Status vs Contact
The shapes of the above graphs seem so similar that it may not be problematic to replace all of the unknown variables with 'divorced'
Categorical vs Categorical Variables
Note: Some of these variables have been compared already after they were converted to ranked, numerical data. Here, we will compare the original variables from the dataframe dataset, before any cleaning or preprocessing occured.
#Gender vs Education_Level
sns.countplot(data = dataframe, x = 'Gender', hue = 'Education_Level')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Gender vs Marital Status
sns.countplot(data = dataframe, x = 'Gender', hue = 'Marital_Status')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Gender vs Income_Category
sns.countplot(data = dataframe, x = 'Gender', hue = 'Income_Category')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Gender vs Card_Category
sns.countplot(data = dataframe, x = 'Gender', hue = 'Card_Category')
<AxesSubplot:xlabel='Gender', ylabel='count'>
#Education_Level vs Marital Status
sns.countplot(data = dataframe, x = 'Education_Level', hue = 'Marital_Status')
<AxesSubplot:xlabel='Education_Level', ylabel='count'>
#Education_Level vs Income_Category
sns.countplot(data = dataframe, x = 'Education_Level', hue = 'Income_Category')
<AxesSubplot:xlabel='Education_Level', ylabel='count'>
#Education_Level vs Card_Category
sns.countplot(data = dataframe, x = 'Education_Level', hue = 'Card_Category')
<AxesSubplot:xlabel='Education_Level', ylabel='count'>
#Marital Status vs Income_Category
sns.countplot(data = dataframe, x = 'Marital_Status', hue = 'Income_Category')
<AxesSubplot:xlabel='Marital_Status', ylabel='count'>
#Marital Status vs Card_Category
sns.countplot(data = dataframe, x = 'Marital_Status', hue = 'Card_Category')
<AxesSubplot:xlabel='Marital_Status', ylabel='count'>
#Income_Category vs Card_Category
sns.countplot(data = dataframe, x = 'Income_Category', hue = 'Card_Category')
<AxesSubplot:xlabel='Income_Category', ylabel='count'>
TB2 = TB.copy() # at this point, we will make a copy of the dataset so we can deal with unknown values in other ways later on
We know that there are three columns with unknown values: Education, Marital Status and Income. Lets see if we can impute some of these unkown values with group modes (since there are all categorical).
Income
Income is a good place to start because we know its highly correlated with gender (negative). There is also moderate correlation with Credit Limit. We can use these two variables to help us predict the Income of customers where the value = unknown.
We know from EDA that, unfortunately, that there are 0 instances of women who have incomes of category 2, 3 or 4. In other words, all known incomes of females is either less than 40k or between 40k and 60k. We can double check...
We also know that there are no instances of below 40k where Credit Limit is greater than 16000. Similarly, we know there are no instances where credit limit is between 40k and 60k and credit limit is greater than 24000. We assume that these are bank imposed Credit Limits based on income. We will maintain these set limits while we attempt to classify the unknown income values. Lets check and see if there are instances where Gender = Female, Income = Unknown AND Credit Limit > 24000
TB.loc[(TB['Gender'] >= 1) & (TB['Income'] == -1) & (TB['Credit_Limit'] >= 24000)] #setting three conditions
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 101 | 0 | 41 | 1 | 3 | -1 | Married | -1 | 1 | 34 | 5 | 3 | 3 | 34516.0 | 2053 | 32463.0 | 1.034 | 1487 | 26 | 0.733 | 0.059 |
| 158 | 0 | 44 | 1 | 2 | 0 | Married | -1 | 1 | 35 | 4 | 3 | 2 | 32643.0 | 0 | 32643.0 | 1.300 | 1058 | 24 | 2.429 | 0.000 |
| 453 | 0 | 39 | 1 | 3 | 0 | Divorced | -1 | 0 | 29 | 5 | 1 | 2 | 26181.0 | 0 | 26181.0 | 0.913 | 1303 | 35 | 0.750 | 0.000 |
| 471 | 0 | 42 | 1 | 5 | 5 | Married | -1 | 0 | 23 | 4 | 2 | 2 | 24602.0 | 1660 | 22942.0 | 0.588 | 1337 | 44 | 0.760 | 0.067 |
| 650 | 0 | 51 | 1 | 3 | 4 | Divorced | -1 | 3 | 34 | 3 | 1 | 2 | 34516.0 | 1578 | 32938.0 | 0.725 | 1929 | 40 | 0.481 | 0.046 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9925 | 0 | 42 | 1 | 1 | 2 | Single | -1 | 1 | 31 | 6 | 2 | 3 | 30770.0 | 2228 | 28542.0 | 0.712 | 15998 | 121 | 0.704 | 0.072 |
| 9951 | 1 | 44 | 1 | 3 | -1 | Single | -1 | 0 | 34 | 2 | 3 | 3 | 26021.0 | 0 | 26021.0 | 1.040 | 8898 | 60 | 0.538 | 0.000 |
| 9967 | 0 | 39 | 1 | 3 | 3 | Unknown | -1 | 1 | 30 | 6 | 3 | 5 | 33905.0 | 2070 | 31835.0 | 0.685 | 15335 | 99 | 0.768 | 0.061 |
| 9968 | 1 | 51 | 1 | 1 | 3 | Married | -1 | 1 | 44 | 4 | 2 | 4 | 33004.0 | 2418 | 30586.0 | 0.923 | 10156 | 72 | 0.946 | 0.073 |
| 10000 | 0 | 47 | 1 | 5 | 5 | Single | -1 | 0 | 36 | 4 | 1 | 4 | 26923.0 | 2461 | 24462.0 | 0.602 | 13643 | 94 | 0.918 | 0.091 |
96 rows × 20 columns
We see 96 instances where women have credit limits above 24,000. This means that their incomes have to be at least 60k per year or more, based on our above assumption. Therefore, at least one more income class for women will be created. Since the mean income for woman is 0 (or below $40k), we will combine these two factors and assume these women have incomes = 2.
np.median(TB[TB.Gender == 1].Income) # find the median income of just the women
0.0
np.median(TB[TB.Gender == 0].Income) # find the median income of just the women
2.0
Let's look at four examples of where Income = -1 (unknown). The first, row 94, is female with credit limit below 16000; the second, row 151, is male; the third row, 1925, is female with credit limit greater than 24000; finally, the 4th row, 7086, is female with credit limit between 16000 and 24000.
TB.loc[[94, 151, 1925, 7086], :]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 94 | 0 | 45 | 1 | 3 | -1 | Married | -1 | 0 | 28 | 5 | 1 | 2 | 2535.0 | 2440 | 95.0 | 1.705 | 1312 | 20 | 1.222 | 0.963 |
| 151 | 0 | 68 | 0 | 1 | 3 | Married | -1 | 0 | 56 | 5 | 2 | 3 | 13860.0 | 1652 | 12208.0 | 1.255 | 1910 | 32 | 1.909 | 0.119 |
| 1925 | 0 | 62 | 1 | 0 | 3 | Single | -1 | 1 | 49 | 5 | 1 | 2 | 30310.0 | 0 | 30310.0 | 0.777 | 3703 | 91 | 0.492 | 0.000 |
| 7086 | 0 | 46 | 1 | 2 | 0 | Divorced | -1 | 0 | 40 | 5 | 2 | 3 | 20080.0 | 0 | 20080.0 | 0.560 | 4620 | 70 | 0.667 | 0.000 |
Ideally, we would like to replace the unknown incomes with 0 if the customer is female or 2 if the customer is male. However, we assume women with Credit Limits greater than 24k must have incomes of 2 or larger. Since it is impossible for us to know if they have incomes of 2,3 or 4, we will replace the -1 for these women with 2, since it is the closest to the group median (0). Therefore, we will write code that will replace Income = -1 & Gender = 0 (male) with the group median, 2, and code that will replace Income = -1 & Gender = 1 (female) with 0, unless their Credit Limit is above 16000 and below 24000, which will trigger a replacement of 1, or 2 if their credit limit is greater than 24000.
TB["Income"].value_counts() # to remind us how many -1 values (the unknowns) there are
0 3561 1 1790 3 1535 2 1402 -1 1112 4 727 Name: Income, dtype: int64
for index in range(0, len(TB)):
if TB['Income'].loc[index] == -1 and TB['Gender'].loc[index] == 0:
TB['Income'].loc[index] = 2
elif TB['Income'].loc[index] == -1 and TB['Credit_Limit'].loc[index] <= 16000:
TB['Income'].loc[index] = 0
elif TB['Income'].loc[index] == -1 and TB['Credit_Limit'].loc[index] <= 24000:
TB['Income'].loc[index] = 1
elif TB['Income'].loc[index] == -1 and TB['Credit_Limit'].loc[index] >= 24000:
TB['Income'].loc[index] = 2
TB["Income"].value_counts() # to remind us how many -1 values (the unknowns) there are
0 4439 1 1876 2 1550 3 1535 4 727 Name: Income, dtype: int64
TB.loc[[94, 151, 1925, 7086], :]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 94 | 0 | 45 | 1 | 3 | -1 | Married | 0 | 0 | 28 | 5 | 1 | 2 | 2535.0 | 2440 | 95.0 | 1.705 | 1312 | 20 | 1.222 | 0.963 |
| 151 | 0 | 68 | 0 | 1 | 3 | Married | 2 | 0 | 56 | 5 | 2 | 3 | 13860.0 | 1652 | 12208.0 | 1.255 | 1910 | 32 | 1.909 | 0.119 |
| 1925 | 0 | 62 | 1 | 0 | 3 | Single | 2 | 1 | 49 | 5 | 1 | 2 | 30310.0 | 0 | 30310.0 | 0.777 | 3703 | 91 | 0.492 | 0.000 |
| 7086 | 0 | 46 | 1 | 2 | 0 | Divorced | 1 | 0 | 40 | 5 | 2 | 3 | 20080.0 | 0 | 20080.0 | 0.560 | 4620 | 70 | 0.667 | 0.000 |
Re-examining rows 94, 151, 1925 and 7086 we should now see that the Income for row 151 = 2. This is the median grouped by gender (male). For rows 94, 151, and 1925, the group median is 0 (female), but we have to ensure that the Incomes are replaced with the closest value to zero that still respects the Credit Limit maximum assumed. Therefore, row 94 should have Income = 0, row 1925's Income = 2 and row 7086 should have income = 1. Lets check..
sns.stripplot(data = TB, x = 'Income', y = 'Gender'); # this will graph income vs gender
sns.stripplot(data = TB, x = 'Income', y = 'Credit_Limit'); # this will graph income vs gender
We can see that incomes of 0 and 1 retained their Credit Limit caps around 16000 and 24000, respectively, while incomes of 2, 3 and 4 enjoy caps around 35000. A new income level for women was created, however. There are now women who make between 60k and 80k per year, putting them in the '2' category. We found no reason why this should not be allowed. It seemed more prudent to keep the Credit Limit boundaries, which were probably pre-set by the bank.
Education
Education is a bit trickier than Income because there were no correlations between this variable and any other variable. This does not mean, however, that there are no patterns. We can start by seeing if gender plays a role in education, followed by age...
TB["Education"].value_counts() # to remind us how many -1 values (the unknowns) there are
3 3128 1 2013 -1 1519 0 1487 2 1013 4 516 5 451 Name: Education, dtype: int64
TB.groupby('Education')['Gender'].mean()
Education -1 0.534562 0 0.535306 1 0.510681 2 0.525173 3 0.533887 4 0.509690 5 0.569845 Name: Gender, dtype: float64
The means are all slightly more than 0.5. Because there are more women customers, we would expect the means to be slightly closer to 1.0 if there was a near equal representation of male and female for each group.
TB.groupby('Education')['Age'].mean()
Education -1 46.428571 0 46.423672 1 46.345256 2 45.970385 3 46.323529 4 45.562016 5 47.261641 Name: Age, dtype: float64
Again, we are not seeing big discrepencies. It would be logical to see older individuals having more education, since it takes longer to acquire higher levels of education, but we are not seeing it in the data.
TB.groupby('Education')['Income'].mean()
Education -1 1.238973 0 1.225958 1 1.269250 2 1.262586 3 1.203645 4 1.265504 5 1.179601 Name: Income, dtype: float64
np.median(TB.Education)
2.0
TB.Education.mode()
0 3 dtype: int64
We know from EDA visual graphing that there are far more 3s than 2s. We can also assume that the -1 values included are dragging the median towards the left (or closer to 0). Therefore, in this case, we will replace unknown with the mode of the column
TB.loc[(TB['Education'] == -1)] # sets up a check to see if the -1s do get converted
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 0 | 51 | 0 | 4 | -1 | Married | 4 | 2 | 46 | 6 | 1 | 3 | 34516.0 | 2264 | 32252.0 | 1.975 | 1330 | 31 | 0.722 | 0.066 |
| 11 | 0 | 65 | 0 | 1 | -1 | Married | 1 | 0 | 54 | 6 | 2 | 3 | 9095.0 | 1587 | 7508.0 | 1.433 | 1314 | 26 | 1.364 | 0.174 |
| 15 | 0 | 44 | 0 | 4 | -1 | Unknown | 3 | 0 | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.700 | 0.230 |
| 17 | 0 | 41 | 0 | 3 | -1 | Married | 3 | 0 | 34 | 4 | 4 | 1 | 13535.0 | 1291 | 12244.0 | 0.653 | 1028 | 21 | 1.625 | 0.095 |
| 23 | 0 | 47 | 1 | 4 | -1 | Single | 0 | 0 | 36 | 3 | 3 | 2 | 2492.0 | 1560 | 932.0 | 0.573 | 1126 | 23 | 0.353 | 0.626 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10090 | 0 | 36 | 1 | 3 | -1 | Married | 1 | 0 | 22 | 5 | 3 | 3 | 12958.0 | 2273 | 10685.0 | 0.608 | 15681 | 96 | 0.627 | 0.175 |
| 10094 | 0 | 59 | 0 | 1 | -1 | Single | 2 | 0 | 48 | 3 | 1 | 2 | 7288.0 | 0 | 7288.0 | 0.640 | 14873 | 120 | 0.714 | 0.000 |
| 10095 | 0 | 46 | 0 | 3 | -1 | Married | 3 | 0 | 33 | 4 | 1 | 3 | 34516.0 | 1099 | 33417.0 | 0.816 | 15490 | 110 | 0.618 | 0.032 |
| 10118 | 1 | 50 | 0 | 1 | -1 | Unknown | 3 | 0 | 36 | 6 | 3 | 4 | 9959.0 | 952 | 9007.0 | 0.825 | 10310 | 63 | 1.100 | 0.096 |
| 10123 | 1 | 41 | 0 | 2 | -1 | Divorced | 1 | 0 | 25 | 4 | 2 | 3 | 4277.0 | 2186 | 2091.0 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
1519 rows × 20 columns
TB['Education'].replace(-1, np.nan, inplace=True) # changes the -1s to NaN
TB['Education'].isnull().sum() # checks to see how m any isnull values there are (there should be 1519 from above)
1519
TB['Education'] = TB['Education'].fillna(TB['Education'].mode()[0])
TB.loc[[15], :]
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15 | 0 | 44 | 0 | 4 | 3.0 | Unknown | 3 | 0 | 37 | 5 | 1 | 2 | 4234.0 | 972 | 3262.0 | 1.707 | 1348 | 27 | 1.7 | 0.23 |
We can see that the -1 value for education in row 15 was changed to the column mode, or 3.
Marital Status
Marital Status is an unordered categorical variable, so we cannot perform correlation analysis on it. As such, we need to look for some patterns like we did with Education. During Bivariate/Multivariate analysis, we discovered that the unknown values were behaving very similar to the divorced values across most other variable types (ie they had very similar relationships). As a result, we plan on just replacing the unknowns with divorced.
TB["Marital_Status"].value_counts() # to remind us how many unknowns there are
Married 4687 Single 3943 Unknown 749 Divorced 748 Name: Marital_Status, dtype: int64
TB['Marital_Status'] = TB['Marital_Status'].map({'Married': 'Married', 'Single': 'Single', 'Divorced': 'Divorced', "Unknown": 'Divorced'
})
#Leaves single, married and divorced as is, and converts unknown to divorced
TB["Marital_Status"].value_counts() # married should equal 4687, single should equal 3943 and divorced should equal 1497 (749+748)
Married 4687 Single 3943 Divorced 1497 Name: Marital_Status, dtype: int64
We can compare the previous graphs of separated Divorced and Unknown with graphs of the new combined variable to see if they are similar.
# Divorced/Unknown vs Balance
sns.stripplot(data = df1, x = 'Marital_Status', y = 'Balance'); # this will graph Marital_Status vs Contact
# Divorced(Combined) vs Balance
sns.stripplot(data = TB, x = 'Marital_Status', y = 'Balance'); # this will graph Marital_Status vs Contact
We know that Credit Limit and Average Credit Line are perfectly correlated. This means that the two will work in tadem to influence the model in a certain direction. This phenomenon is called multicollinearity. In princple, since the two variables are telling us the same thing, it is prudent to drop one of the columns so that their affect is not amplified. Let's drop the Average Credit Line column...
TB.drop('Ave_Credit_Line',axis=1,inplace=True) # this drops the column since it is perfectly correlated with another column and will cause multicollinearity
Spliting the dataframe into categorical columns so that we can evaluate corrlation of categorical data using Cramer's V function.
TB_categorical = dataframe[['Gender', 'Education_Level', 'Marital_Status', 'Income_Category','Card_Category','Attrition_Flag']]
TB_categorical.head()
| Gender | Education_Level | Marital_Status | Income_Category | Card_Category | Attrition_Flag | |
|---|---|---|---|---|---|---|
| 0 | M | High School | Married | $60K - $80K | Blue | Existing Customer |
| 1 | F | Graduate | Single | Less than $40K | Blue | Existing Customer |
| 2 | M | Graduate | Married | $80K - $120K | Blue | Existing Customer |
| 3 | F | High School | Unknown | Less than $40K | Blue | Existing Customer |
| 4 | M | Uneducated | Married | $60K - $80K | Blue | Existing Customer |
label = preprocessing.LabelEncoder()
TB_categorical_encoded = pd.DataFrame()
for i in TB_categorical.columns :
TB_categorical_encoded[i]=label.fit_transform(TB_categorical[i])
def cramers_V(var1,var2) :
crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
obs = np.sum(crosstab) # Number of observations
mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
return (stat/(obs*mini))
rows= []
for var1 in TB_categorical_encoded:
col = []
for var2 in TB_categorical_encoded :
cramers =cramers_V(TB_categorical_encoded[var1], TB_categorical_encoded[var2]) # Cramer's V test
col.append(round(cramers,2)) # Keeping of the rounded value of the Cramer's V
rows.append(col)
cramers_results = np.array(rows)
cramerv_matrix = pd.DataFrame(cramers_results, columns = TB_categorical_encoded.columns, index =TB_categorical_encoded.columns)
mask = np.triu(np.ones_like(cramerv_matrix, dtype=np.bool))
cat_heatmap = sns.heatmap(cramerv_matrix, mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
cat_heatmap.set_title('Categorical Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
The above heatmap shows that catigorical columns are not correlated with customer churn by themselves.
TB_numerical = dataframe
TB_numerical.drop('CLIENTNUM', axis=1,inplace=True)
TB_numerical.drop('Gender', axis=1,inplace=True)
TB_numerical.drop('Education_Level', axis=1,inplace=True)
TB_numerical.drop('Marital_Status', axis=1,inplace=True)
TB_numerical.drop('Income_Category', axis=1,inplace=True)
TB_numerical.drop('Card_Category', axis=1,inplace=True)
TB_numerical.head()
| Attrition_Flag | Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | 3 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | 5 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | 3 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | 4 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | 3 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
TB_numerical = pd.get_dummies(TB_numerical, prefix=['Attrition_Flag'], columns=['Attrition_Flag'])
num_corr=TB_numerical.corr()
plt.figure(figsize=(16, 6))
mask = np.triu(np.ones_like(num_corr, dtype=np.bool))
num_heatmap = sns.heatmap(num_corr, mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
num_heatmap.set_title('Numerical Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
The above heatmap allows us to see correlations between the numerical data and churn. We can rank them using the following code...
fig, ax=plt.subplots(ncols=2,figsize=(15, 5))
heatmap = sns.heatmap(num_corr[['Attrition_Flag_Existing Customer']].sort_values(by='Attrition_Flag_Existing Customer', ascending=False), ax=ax[0],vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Existing Customers', fontdict={'fontsize':18}, pad=16);
heatmap = sns.heatmap(num_corr[['Attrition_Flag_Attrited Customer']].sort_values(by='Attrition_Flag_Attrited Customer', ascending=False), ax=ax[1],vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Attrited Customers', fontdict={'fontsize':18}, pad=16);
fig.tight_layout(pad=5)
We have previously defined 'no-correlation' as values between +0.1 and -0.1. As such, we can see that there is no correlation between churn and the following variables:
Since there is no correlation between these variables and the dependent variable, we can remove these from the model. We have already dropped the Average Open to Buy column (renamed Ave_Credit_Line) above due to multicollinearity
TB.drop('Age',axis=1,inplace=True) # this drops the columns that show no correlation with the dependent variable
TB.drop('Dependents',axis=1,inplace=True)
TB.drop('Months',axis=1,inplace=True)
TB.drop('Credit_Limit',axis=1,inplace=True)
TB.head()
| Flag | Gender | Education | Marital_Status | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1.0 | Married | 2 | 0 | 5 | 1 | 3 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 1 | 3.0 | Single | 0 | 0 | 6 | 1 | 2 | 864 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 0 | 3.0 | Married | 3 | 0 | 4 | 1 | 0 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 1 | 1.0 | Divorced | 0 | 0 | 3 | 4 | 1 | 2517 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 0 | 0.0 | Married | 2 | 0 | 5 | 1 | 0 | 0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Our next step in data preprocessing is to convert the categorical columns to binary. We have already converted Education, Income and Card columns because we wanted to retain their ranked, hierarchical structure. However, with Marital Status, there was no hierarchy, so this was not possible. Now, we will write code so that Marital Status is converted to three columns: Married, Single and Divorced. We will then drop one column since, the way python works, if a customer scores 0 in both Married and Single, it MUST score a 1 in the Divorced column. Python knows this, so the third column is not necessary.
TB["Marital_Status"].value_counts()
Married 4687 Single 3943 Divorced 1497 Name: Marital_Status, dtype: int64
As we eluded to above, we will drop the Divorced column since it has the lowest number of customers. We also must drop the original Marital_Status column so that this information is not counted twice...
TB = pd.concat([TB,pd.get_dummies(TB['Marital_Status']).drop(columns=['Divorced'])],axis=1)
TB.drop('Marital_Status',axis=1,inplace=True)
TB.head() #checks to see if two new columns, "Married" and "Single" are added while 'Marital_Status' is dropped
| Flag | Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1.0 | 2 | 0 | 5 | 1 | 3 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 1 | 0 |
| 1 | 0 | 1 | 3.0 | 0 | 0 | 6 | 1 | 2 | 864 | 1.541 | 1291 | 33 | 3.714 | 0.105 | 0 | 1 |
| 2 | 0 | 0 | 3.0 | 3 | 0 | 4 | 1 | 0 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 1 | 0 |
| 3 | 0 | 1 | 1.0 | 0 | 0 | 3 | 4 | 1 | 2517 | 1.405 | 1171 | 20 | 2.333 | 0.760 | 0 | 0 |
| 4 | 0 | 0 | 0.0 | 2 | 0 | 5 | 1 | 0 | 0 | 2.175 | 816 | 28 | 2.500 | 0.000 | 1 | 0 |
In an effort to select the very best model possible, we will have to construct various individual models and apply different ensemble techniques, cross-validation methods and model tuning procedures. Ultimately, we will conduct 4 different approaches in our effort to select the best model for the Bank's goals.
x = TB.drop(["Flag"], axis=1)
y = TB["Flag"]
# Scale all the columns of the x dataframe. This will produce a numpy array
X_scaled = preprocessing.scale(x)
X_scaled = pd.DataFrame(X_scaled, columns=x.columns)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.30, random_state=1)
y.value_counts(1)
0 0.83934 1 0.16066 Name: Flag, dtype: float64
y_train.value_counts(1)
0 0.839306 1 0.160694 Name: Flag, dtype: float64
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(train)
pred_test = model.predict(test)
train_acc = model.score(train,train_y)
test_acc = model.score(test,test_y)
train_recall = metrics.recall_score(train_y,pred_train)
test_recall = metrics.recall_score(test_y,pred_test)
train_precision = metrics.precision_score(train_y,pred_train)
test_precision = metrics.precision_score(test_y,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
# If the flag is set to True then only the following print statements will be displayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(train,train_y))
print("Accuracy on test set : ",model.score(test,test_y))
print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
lr = LogisticRegression(random_state=1)
# Training the basic logistic regression model with training set
lr.fit(X_train,y_train)
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, lr.coef_[0][idx]))
The coefficient for Gender is 0.5051459064982164 The coefficient for Education is 0.02332358220784528 The coefficient for Income is 0.24646547483083936 The coefficient for Card is 0.12314683477732635 The coefficient for Products_Held is -0.6906475167859727 The coefficient for Months_Inactive is 0.5112620107887936 The coefficient for Contacts is 0.5400444679778014 The coefficient for Balance is -0.8202728096947371 The coefficient for Trans_Changes is -0.05781014035407213 The coefficient for Trans_Totals is 1.598374818122629 The coefficient for Trans_Count is -2.7392145697676207 The coefficient for Count_Changes is -0.6987322352750124 The coefficient for Ratio is 0.07654562403404999 The coefficient for Married is -0.3071411714042533 The coefficient for Single is 0.03591866364873732
intercept = lr.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -3.0173145318247103
Let's evaluate the model performance by using KFold and cross_val_score
K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_bfr)
plt.show()
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(lr,y_test)
Accuracy on training set : 0.907872460496614 Accuracy on test set : 0.9016123724909509 Recall on training set : 0.6022827041264267 Recall on test set : 0.5655737704918032 Precision on training set : 0.7742663656884876 Precision on test set : 0.7603305785123967
We will attempt to improve the model's performance further by:
a) Oversampling - getting more data points for the minority class.
b) Undersampling - dealing with the class imbalance.
print("Before UpSampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before UpSampling, counts of label 'No': {} \n".format(sum(y_train==0)))
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1) #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After UpSampling, counts of label 'Yes': {}".format(sum(y_train_over==1)))
print("After UpSampling, counts of label 'No': {} \n".format(sum(y_train_over==0)))
print('After UpSampling, the shape of train_X: {}'.format(X_train_over.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_over.shape))
Before UpSampling, counts of label 'Yes': 1139 Before UpSampling, counts of label 'No': 5949 After UpSampling, counts of label 'Yes': 5949 After UpSampling, counts of label 'No': 5949 After UpSampling, the shape of train_X: (11898, 15) After UpSampling, the shape of train_y: (11898,)
log_reg_over = LogisticRegression(random_state = 1)
# Training the oversampled logistic regression model with training set
log_reg_over.fit(X_train_over,y_train_over)
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, log_reg_over.coef_[0][idx]))
The coefficient for Gender is 0.5550976634208064 The coefficient for Education is -0.009578492288227327 The coefficient for Income is 0.2567222756978001 The coefficient for Card is 0.11430309081156968 The coefficient for Products_Held is -0.6105537999924109 The coefficient for Months_Inactive is 0.5969902231126034 The coefficient for Contacts is 0.5842148798429835 The coefficient for Balance is -0.7007504512324066 The coefficient for Trans_Changes is -0.10132543227102565 The coefficient for Trans_Totals is 2.0131021689042723 The coefficient for Trans_Count is -3.269376382956682 The coefficient for Count_Changes is -0.7338629573911205 The coefficient for Ratio is 0.011548600280648948 The coefficient for Married is -0.2405978503287487 The coefficient for Single is 0.08073476674522646
intercept = log_reg_over.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -1.5930526004266676
We can evaluate the model performance by using KFold and cross_val_score...
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_over=cross_val_score(estimator=log_reg_over, X=X_train_over, y=y_train_over, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_over)
plt.show()
#Calculating different metrics
get_metrics_score(log_reg_over,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_over,y_test)
Accuracy on training set : 0.8571188435031097 Accuracy on test set : 0.8525830865416255 Recall on training set : 0.8636745671541436 Recall on test set : 0.8381147540983607 Precision on training set : 0.8524970963995354 Precision on test set : 0.525706940874036
rus = RandomUnderSampler(random_state = 1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un==1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_un.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 1139 Before Under Sampling, counts of label 'No': 5949 After Under Sampling, counts of label 'Yes': 1139 After Under Sampling, counts of label 'No': 1139 After Under Sampling, the shape of train_X: (2278, 15) After Under Sampling, the shape of train_y: (2278,)
log_reg_under = LogisticRegression(random_state = 1)
# Training the undersampled logistic regression model with training set
log_reg_under.fit(X_train_un,y_train_un )
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, log_reg_under.coef_[0][idx]))
The coefficient for Gender is 0.3931042083431588 The coefficient for Education is 0.025667024578809552 The coefficient for Income is 0.12026535981601418 The coefficient for Card is 0.12671767886643143 The coefficient for Products_Held is -0.5432415466306302 The coefficient for Months_Inactive is 0.5209138130609903 The coefficient for Contacts is 0.5608219101761733 The coefficient for Balance is -0.6190012162729854 The coefficient for Trans_Changes is -0.16232523527407994 The coefficient for Trans_Totals is 1.6064406429152867 The coefficient for Trans_Count is -2.7650055114858354 The coefficient for Count_Changes is -0.5619140877918258 The coefficient for Ratio is -0.026442099161787646 The coefficient for Married is -0.2842695604428667 The coefficient for Single is 0.045046112024145143
intercept = log_reg_under.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -1.383621107962739
We can evaluate the model performance by using KFold and cross_val_score...
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1) #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=log_reg_under, X=X_train_un, y=y_train_un, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()
#Calculating different metrics
get_metrics_score(log_reg_under,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(log_reg_under,y_test)
Accuracy on training set : 0.8397717295873574 Accuracy on test set : 0.8545574202040145 Recall on training set : 0.8419666374012291 Recall on test set : 0.8299180327868853 Precision on training set : 0.8382867132867133 Precision on test set : 0.5301047120418848
# Choose the type of classifier.
lr_estimator = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
lr_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator.fit(X_train, y_train)
LogisticRegression(C=0.6, random_state=1, solver='saga')
#Calculating different metrics
get_metrics_score(lr_estimator,X_train,X_test,y_train,y_test)
# creating confusion matrix
make_confusion_matrix(lr_estimator,y_test)
Accuracy on training set : 0.9081546275395034 Accuracy on test set : 0.9012833168805529 Recall on training set : 0.6014047410008779 Recall on test set : 0.5635245901639344 Precision on training set : 0.7766439909297053 Precision on test set : 0.7596685082872928
Regularized RIDGE Model
ridge = Ridge(alpha=.01)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 0.04492229 0.00235217 0.02098191 0.00809855 -0.0654128 0.04086643 0.04370765 -0.08229539 -0.00956574 0.12278804 -0.23667012 -0.07227584 0.00751409 -0.02540581 0.00235008]
Regularized LASSO Model
lasso = Lasso(alpha=0.005)
lasso.fit(X_train,y_train)
print ("Lasso model:", (lasso.coef_))
# Observe, many of the coefficients have become 0 indicating drop of those dimensions from the model
Lasso model: [ 0.02279482 0. 0. 0.00410982 -0.06383115 0.036566 0.03973016 -0.0732561 -0.00502319 0.09431177 -0.20794825 -0.07094405 -0. -0.02099102 0. ]
Comparing the Scores
print(lr.score(X_train, y_train))
print(lr.score(X_test, y_test))
0.907872460496614 0.9016123724909509
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.3726100628758876 0.3716454834758194
print(lasso.score(X_train, y_train))
print(lasso.score(X_test, y_test))
0.3679244184419942 0.3694626546272921
Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially.
# Choose the type of classifier.
lr_estimator1 = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator1, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_over, y_train_over)
# Set the clf to the best combination of parameters
lr_estimator1 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator1.fit(X_train_over, y_train_over)
LogisticRegression(C=0.4, random_state=1, solver='saga')
#Calculating different metrics
get_metrics_score(lr_estimator1,X_train_over,X_test,y_train_over,y_test)
# creating confusion matrix
make_confusion_matrix(lr_estimator1,y_test)
Accuracy on training set : 0.8573709867204572 Accuracy on test set : 0.8532411977624218 Recall on training set : 0.8640107581106068 Recall on test set : 0.8381147540983607 Precision on training set : 0.8526874585268746 Precision on test set : 0.5270618556701031
Regularized RIDGE Model
ridge = Ridge(alpha=.01)
ridge.fit(X_train_over,y_train_over)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 6.54898071e-02 -1.18020906e-04 2.84322251e-02 1.49042371e-02 -6.72730382e-02 6.66130260e-02 6.60972719e-02 -9.62105743e-02 -1.09448389e-02 2.40369787e-01 -4.06007892e-01 -9.35919796e-02 9.55794377e-03 -2.71550921e-02 1.29320376e-02]
Regularized LASSO Model
lasso = Lasso(alpha=0.005)
lasso.fit(X_train_over,y_train_over)
print ("Lasso model:", (lasso.coef_))
# Observe, many of the coefficients have become 0 indicating drop of those dimensions from the model
Lasso model: [ 0.03686427 0. 0. 0.01194112 -0.06757141 0.06264937 0.06280585 -0.08779564 -0.00496682 0.20477449 -0.37456175 -0.09415545 -0. -0.02261419 0.01151724]
Comparing the Scores
print(log_reg_over.score(X_train_over, y_train_over))
print(log_reg_over.score(X_test, y_test))
0.8571188435031097 0.8525830865416255
print(ridge.score(X_train_over, y_train_over))
print(ridge.score(X_test, y_test))
0.540511103032892 0.10811655417849364
print(lasso.score(X_train_over, y_train_over))
print(lasso.score(X_test, y_test))
0.5375677636450668 0.11509734091917045
Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially, especially on the test data.
# Choose the type of classifier.
lr_estimator2 = LogisticRegression(random_state=1,solver='saga')
# Grid of parameters to choose from
parameters = {'C': np.arange(0.1,1.1,0.1)}
# Run the grid search
grid_obj = GridSearchCV(lr_estimator2, parameters, scoring='recall')
grid_obj = grid_obj.fit(X_train_un, y_train_un)
# Set the clf to the best combination of parameters
lr_estimator2 = grid_obj.best_estimator_
# Fit the best algorithm to the data.
lr_estimator2.fit(X_train_un, y_train_un)
LogisticRegression(C=0.1, random_state=1, solver='saga')
#Calculating different metrics
get_metrics_score(lr_estimator2,X_train_un,X_test,y_train_un,y_test)
# creating confusion matrix
make_confusion_matrix(lr_estimator2,y_test)
Accuracy on training set : 0.8397717295873574 Accuracy on test set : 0.8575189206975979 Recall on training set : 0.8419666374012291 Recall on test set : 0.8340163934426229 Precision on training set : 0.8382867132867133 Precision on test set : 0.5362318840579711
Regularized RIDGE Model
ridge = Ridge(alpha=.01)
ridge.fit(X_train_un,y_train_un)
print ("Ridge model:", (ridge.coef_))
Ridge model: [ 5.49954282e-02 4.99184670e-03 1.79396550e-02 1.29715890e-02 -6.64820805e-02 6.39893851e-02 6.90209930e-02 -8.82491807e-02 -2.26784698e-02 2.19450838e-01 -3.78266858e-01 -7.93443177e-02 -2.23339550e-04 -3.76716998e-02 7.62918616e-03]
Regularized LASSO Model
lasso = Lasso(alpha=0.005)
lasso.fit(X_train_un,y_train_un)
print ("Lasso model:", (lasso.coef_))
# Observe, many of the coefficients have become 0 indicating drop of those dimensions from the model
Lasso model: [ 3.31973762e-02 9.95660140e-05 0.00000000e+00 1.11835724e-02 -6.62235643e-02 6.04047318e-02 6.61282370e-02 -8.58638493e-02 -1.67021409e-02 1.86456317e-01 -3.49542522e-01 -7.95122616e-02 -1.38746189e-03 -3.41027306e-02 5.05472336e-03]
Comparing the Scores
print(log_reg_under.score(X_train_un, y_train_un))
print(log_reg_under.score(X_test, y_test))
0.8397717295873574 0.8545574202040145
print(ridge.score(X_train_un, y_train_un))
print(ridge.score(X_test, y_test))
0.5074673487081532 0.12816227887166132
print(lasso.score(X_train_un, y_train_un))
print(lasso.score(X_test, y_test))
0.5050684462894106 0.12925482310787018
Both Ridge and Lasso are helping to reduce the complexity of the simple regression model, but at a high cost. The recall scores have dropped substantially, especially on the test data.
# defining list of model
models = [lr, lr_estimator]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train,X_test,y_train,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of models
models = [log_reg_over, lr_estimator1]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_over,X_test,y_train_over,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
# defining list of model
models = [log_reg_under, lr_estimator2]
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,X_train_un,X_test,y_train_un,y_test,False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame0 = pd.DataFrame({'Model':['Logistic Regression', 'Logistic Regression-Regularized', 'Logistic Regression on Oversampled data',
'Logistic Regression-Regularized (Oversampled data)','Logistic Regression on Undersampled data', 'Logistic Regression-Regularized (Undersampled data)'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test})
#Sorting models in decreasing order of test recall
comparison_frame0.sort_values(by='Test_Recall',ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 2 | Logistic Regression on Oversampled data | 0.857119 | 0.852583 | 0.863675 | 0.838115 | 0.852497 | 0.525707 |
| 3 | Logistic Regression-Regularized (Oversampled d... | 0.857371 | 0.853241 | 0.864011 | 0.838115 | 0.852687 | 0.527062 |
| 5 | Logistic Regression-Regularized (Undersampled ... | 0.839772 | 0.857519 | 0.841967 | 0.834016 | 0.838287 | 0.536232 |
| 4 | Logistic Regression on Undersampled data | 0.839772 | 0.854557 | 0.841967 | 0.829918 | 0.838287 | 0.530105 |
| 0 | Logistic Regression | 0.907872 | 0.901612 | 0.602283 | 0.565574 | 0.774266 | 0.760331 |
| 1 | Logistic Regression-Regularized | 0.908155 | 0.901283 | 0.601405 | 0.563525 | 0.776644 | 0.759669 |
log_odds = lr_estimator1.coef_[0]
pd.DataFrame(log_odds, X_train_over.columns, columns=['coef'])
| coef | |
|---|---|
| Gender | 0.547892 |
| Education | -0.009630 |
| Income | 0.251814 |
| Card | 0.113932 |
| Products_Held | -0.608575 |
| Months_Inactive | 0.593024 |
| Contacts | 0.581052 |
| Balance | -0.695118 |
| Trans_Changes | -0.099495 |
| Trans_Totals | 1.980312 |
| Trans_Count | -3.229856 |
| Count_Changes | -0.731847 |
| Ratio | 0.008446 |
| Married | -0.237793 |
| Single | 0.080843 |
odds = np.exp(lr_estimator1.coef_[0]) # converting coefficients to odds
pd.set_option('display.max_columns',None) # removing limit from number of columns to display
pd.DataFrame(odds, X_train_over.columns, columns=['odds']).T # adding the odds to a dataframe
| Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| odds | 1.729603 | 0.990416 | 1.286357 | 1.120676 | 0.544126 | 1.809452 | 1.787918 | 0.499015 | 0.905294 | 7.245003 | 0.039563 | 0.48102 | 1.008482 | 0.788366 | 1.084201 |
perc_change_odds = (np.exp(lr_estimator1.coef_[0])-1)*100 # finding the percentage change
pd.set_option('display.max_columns',None) # removing limit from number of columns to display
pd.DataFrame(perc_change_odds, X_train_over.columns, columns=['change_odds%']).T # adding the change_odds% to a dataframe
| Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| change_odds% | 72.960341 | -0.958377 | 28.635686 | 12.06757 | -45.587435 | 80.945192 | 78.791826 | -50.098462 | -9.470559 | 624.500277 | -96.043682 | -51.898022 | 0.8482 | -21.163428 | 8.420065 |
Trans_Totals: Holding all other features constant a 1 unit change in the number of transactions made by a customer will increase the odds of a credit card account being attrited by 7.25 times or a 624.5% increase in odds of attrition
Months Inactive: For every month an account is inactive, the odds go up by 1.81 that the account will become attrited
Contact: For every contact made by the bank, the odds go up by 1.79 that the account will become attrited
Trans_Counts: Holding all other features constant a 1 unit change in Trans_Counts will decrease the odds of a credit card account being attrited by 0.04 times or a 96.04% decrease in odds of attrition
Count_Changes: For every additional change to the account, the odds go down by 0.48 that the account will become attrited
Products_Held: For every additional product held by the customer, the odds go down by 0.54 that the account will become attrited
Single: Single individuals are 8.4% more likely to have attrited accounts that Divorced customers
Married: Married people have a 21.16% chance of having their accounts remain open vs Divorced people
Logistic Regression-Regularized (Oversampled data) gives a generalized model on the test and train datasets and is the new standard to beat, with recall scores on the test data of 83.8% and False Negatives of 79
x = TB.drop(["Flag"], axis=1)
y = TB["Flag"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3 , random_state=7,stratify=y)
y.value_counts(1)
0 0.83934 1 0.16066 Name: Flag, dtype: float64
y_train.value_counts(1)
0 0.839306 1 0.160694 Name: Flag, dtype: float64
Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix...
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
'''
model : classifier to predict values of X
y_actual : ground truth
'''
y_predict = model.predict(X_test)
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
'''
model : classifier to predict values of X
'''
# defining an empty list to store train and test results
score_list=[]
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train,y_train)
test_acc = model.score(X_test,y_test)
train_recall = metrics.recall_score(y_train,pred_train)
test_recall = metrics.recall_score(y_test,pred_test)
train_precision = metrics.precision_score(y_train,pred_train)
test_precision = metrics.precision_score(y_test,pred_test)
train_f1 = f1_score(y_train,pred_train)
test_f1 = f1_score(y_test,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision, train_f1, test_f1))
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ",model.score(X_train,y_train))
print("Accuracy on test set : ",model.score(X_test,y_test))
print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
print("F1-Score on training set : ",metrics.f1_score(y_train,pred_train))
print("F1-Score on test set : ",metrics.f1_score(y_test,pred_test))
return score_list # returning the list with train and test scores
#Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
get_metrics_score(d_tree)
#Creating confusion matrix
make_confusion_matrix(d_tree,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9256334320500165 Recall on training set : 1.0 Recall on test set : 0.7704918032786885 Precision on training set : 1.0 Precision on test set : 0.7673469387755102 F1-Score on training set : 1.0 F1-Score on test set : 0.7689161554192228
#Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.18,1:0.72},random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
'min_samples_leaf': [1, 2, 5, 7, 10],
'max_leaf_nodes' : [2, 3, 5, 10,15],
'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.18, 1: 0.72}, max_depth=5,
max_leaf_nodes=15, min_impurity_decrease=0.0001,
min_samples_leaf=10, random_state=1)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
get_metrics_score(dtree_estimator)
#Creating confusion matrix
make_confusion_matrix(dtree_estimator,y_test)
Accuracy on training set : 0.920852144469526 Accuracy on test set : 0.9052319842053307 Recall on training set : 0.9104477611940298 Recall on test set : 0.8831967213114754 Precision on training set : 0.6931818181818182 Precision on test set : 0.6510574018126888 F1-Score on training set : 0.7870967741935485 F1-Score on test set : 0.7495652173913043
#base_estimator for bagging classifier is a decision tree by default
bagging_estimator=BaggingClassifier(random_state=1)
bagging_estimator.fit(X_train,y_train)
BaggingClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision and F1 scores on train and test set
bagging_estimator_score=get_metrics_score(bagging_estimator)
#Creating confusion matrix
make_confusion_matrix(bagging_estimator,y_test)
Accuracy on training set : 0.9950620767494357 Accuracy on test set : 0.9483382691674893 Recall on training set : 0.974539069359087 Recall on test set : 0.7868852459016393 Precision on training set : 0.9946236559139785 Precision on test set : 0.8787185354691075 F1-Score on training set : 0.9844789356984479 F1-Score on test set : 0.8302702702702703
#Train the random forest classifier
rf_estimator=RandomForestClassifier(random_state=1)
rf_estimator.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_score=get_metrics_score(rf_estimator)
# make the confusion matrix
make_confusion_matrix(rf_estimator,y_test)
Accuracy on training set : 1.0 Accuracy on test set : 0.9536031589338598 Recall on training set : 1.0 Recall on test set : 0.7889344262295082 Precision on training set : 1.0 Precision on test set : 0.9101654846335697 F1-Score on training set : 1.0 F1-Score on test set : 0.845225027442371
Some of the important hyperparameters available for bagging classifier are:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {'max_samples': [0.7,0.8,0.9,1],
'max_features': [0.7,0.8,0.9,1],
'n_estimators' : [10,20,30,40,50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)
BaggingClassifier(max_features=0.9, max_samples=0.8, n_estimators=50,
random_state=1)
Let's check different metrics for bagging classifier with best hyperparameters and build a confusion matrix.
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator_tuned)
#make the confusion matrix
make_confusion_matrix(bagging_estimator_tuned,y_test)
Accuracy on training set : 0.9987302483069977 Accuracy on test set : 0.9582099374794341 Recall on training set : 0.9920983318700615 Recall on test set : 0.8381147540983607 Precision on training set : 1.0 Precision on test set : 0.8949671772428884 F1-Score on training set : 0.9960334949316879 F1-Score on test set : 0.8656084656084656
base_estimator of the bagging classifier, which is a decision tree by default.bagging_lr=BaggingClassifier(base_estimator=LogisticRegression(random_state=1),random_state=1)
bagging_lr.fit(X_train,y_train)
BaggingClassifier(base_estimator=LogisticRegression(random_state=1),
random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
bagging_lr_score=get_metrics_score(bagging_lr)
Accuracy on training set : 0.889813769751693 Accuracy on test set : 0.8877920368542284 Recall on training set : 0.485513608428446 Recall on test set : 0.45491803278688525 Precision on training set : 0.7393048128342246 Precision on test set : 0.7474747474747475 F1-Score on training set : 0.5861155272919979 F1-Score on test set : 0.5656050955414013
make_confusion_matrix(bagging_lr,y_test)
We will try to improve the model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:
oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy, default=False.
Note: A lot of hyperparameters of Decision Trees are also available to tune Random Forest like max_depth, min_sample_split etc.
# Choose the type of classifier.
rf_estimator_tuned = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {"n_estimators": [150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(max_features=0.6000000000000001,
max_samples=0.6000000000000001, min_samples_leaf=5,
n_estimators=200, random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_tuned_score=get_metrics_score(rf_estimator_tuned)
#make confusion matrix
make_confusion_matrix(rf_estimator_tuned,y_test)
Accuracy on training set : 0.973617381489842 Accuracy on test set : 0.9526159921026653 Recall on training set : 0.8946444249341527 Recall on test set : 0.8114754098360656 Precision on training set : 0.9383057090239411 Precision on test set : 0.8839285714285714 F1-Score on training set : 0.9159550561797752 F1-Score on test set : 0.8461538461538461
The model performance is not very good. This may be due to the fact that the classes are imbalanced with 81.2% customers who say no and 18.8% customers who say yes.
We should make the model aware that the class of interest here is 'yes'.
We can do so by passing the parameter class_weights available for random forest. This parameter is not available for the bagging classifier.
class_weight specifies the weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
We can choose class_weights={0:0.188,1:0.812} because that is the original imbalance in our data.
# Choose the type of classifier.
rf_estimator_weighted = RandomForestClassifier(random_state=1)
# Grid of parameters to choose from
## add from article
parameters = {
"class_weight": [{0: 0.188, 1: 0.812}], # setting the weights equal to the 81.2% No, 18.8% Yes ratio
"n_estimators": [100,150,200,250],
"min_samples_leaf": np.arange(5, 10),
"max_features": np.arange(0.2, 0.7, 0.1),
"max_samples": np.arange(0.3, 0.7, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_weighted, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator_weighted = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator_weighted.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.188, 1: 0.812},
max_features=0.4000000000000001,
max_samples=0.6000000000000001, min_samples_leaf=9,
n_estimators=150, random_state=1)
#Using above defined function to get accuracy, recall and precision on train and test set
rf_estimator_weighted_score=get_metrics_score(rf_estimator_weighted)
#make confusion matrix
make_confusion_matrix(rf_estimator_weighted,y_test)
Accuracy on training set : 0.9621896162528216 Accuracy on test set : 0.9437314906219151 Recall on training set : 0.9561018437225637 Recall on test set : 0.9016393442622951 Precision on training set : 0.8332058148431523 Precision on test set : 0.7815275310834814 F1-Score on training set : 0.8904333605887164 F1-Score on test set : 0.8372978116079924
# defining list of models
models = [bagging_estimator,bagging_estimator_tuned,bagging_lr,rf_estimator,rf_estimator_tuned,
rf_estimator_weighted]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the accuracy, precall and precision scores
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_train.append(np.round(j[6],2))
f1_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['Bagging classifier with default parameters','Tuned Bagging Classifier',
'Bagging classifier with base_estimator=LR', 'Random Forest with deafult parameters',
'Tuned Random Forest Classifier','Random Forest with class_weights'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})
#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1-Score | Test_F1-Score | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | Random Forest with class_weights | 0.96 | 0.94 | 0.96 | 0.90 | 0.83 | 0.78 | 0.89 | 0.84 |
| 1 | Tuned Bagging Classifier | 1.00 | 0.96 | 0.99 | 0.84 | 1.00 | 0.89 | 1.00 | 0.87 |
| 4 | Tuned Random Forest Classifier | 0.97 | 0.95 | 0.89 | 0.81 | 0.94 | 0.88 | 0.92 | 0.85 |
| 0 | Bagging classifier with default parameters | 1.00 | 0.95 | 0.97 | 0.79 | 0.99 | 0.88 | 0.98 | 0.83 |
| 3 | Random Forest with deafult parameters | 1.00 | 0.95 | 1.00 | 0.79 | 1.00 | 0.91 | 1.00 | 0.85 |
| 2 | Bagging classifier with base_estimator=LR | 0.89 | 0.89 | 0.49 | 0.45 | 0.74 | 0.75 | 0.59 | 0.57 |
#Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(ab_classifier)
#Creating confusion matrix
make_confusion_matrix(ab_classifier,y_test)
Accuracy on training set : 0.9628950338600452 Accuracy on test set : 0.9480092135570911 Recall on training set : 0.8612818261633012 Recall on test set : 0.7848360655737705 Precision on training set : 0.9033149171270718 Precision on test set : 0.8784403669724771 F1-Score on training set : 0.8817977528089888 F1-Score on test set : 0.8290043290043292
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
#Let's try different max_depth for base_estimator
"base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(10,110,10),
"learning_rate":np.arange(0.1,2,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),
learning_rate=0.30000000000000004, n_estimators=100,
random_state=1)
#Calculating different metrics
get_metrics_score(abc_tuned)
#Creating confusion matrix
make_confusion_matrix(abc_tuned,y_test)
Accuracy on training set : 0.9802483069977427 Accuracy on test set : 0.9582099374794341 Recall on training set : 0.9218612818261633 Recall on test set : 0.8319672131147541 Precision on training set : 0.9536784741144414 Precision on test set : 0.9002217294900222 F1-Score on training set : 0.9375 F1-Score on test set : 0.8647497337593184
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(gb_classifier)
#Creating confusion matrix
make_confusion_matrix(gb_classifier,y_test)
Accuracy on training set : 0.9753103837471784 Accuracy on test set : 0.9585389930898321 Recall on training set : 0.8928884986830553 Recall on test set : 0.8155737704918032 Precision on training set : 0.9504672897196261 Precision on test set : 0.9170506912442397 F1-Score on training set : 0.9207786328655501 F1-Score on test set : 0.8633405639913233
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
# Grid of parameters to choose from
parameters = {
"n_estimators": [100,150,200,250],
"subsample":[0.8,0.9,1],
"max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=250, random_state=1,
subsample=0.9)
#Calculating different metrics
get_metrics_score(gbc_tuned)
#Creating confusion matrix
make_confusion_matrix(gbc_tuned,y_test)
Accuracy on training set : 0.9863148984198645 Accuracy on test set : 0.9651201052977953 Recall on training set : 0.9455662862159789 Recall on test set : 0.8586065573770492 Precision on training set : 0.9685251798561151 Precision on test set : 0.918859649122807 F1-Score on training set : 0.9569080408707241 F1-Score on test set : 0.8877118644067796
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss', enable_categorical='True')
xgb_classifier.fit(X_train,y_train)
#Calculating different metrics
get_metrics_score(xgb_classifier)
#Creating confusion matrix
make_confusion_matrix(xgb_classifier,y_test)
[17:48:35] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:573:
Parameters: { "enable_categorical" } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
Accuracy on training set : 1.0
Accuracy on test set : 0.9641329384666009
Recall on training set : 1.0
Recall on test set : 0.8709016393442623
Precision on training set : 1.0
Precision on test set : 0.9023354564755839
F1-Score on training set : 1.0
F1-Score on test set : 0.886339937434828
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')
# Grid of parameters to choose from
parameters = {
"n_estimators": [200],
"scale_pos_weight":[3],
"subsample":[0.9],
"learning_rate":[0.01],
"gamma":[0],
"colsample_bytree":[0.9],
"colsample_bylevel":[1],
"colsample_bynode":[1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.9, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.01, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=200, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=3, subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None)
#Calculating different metrics
get_metrics_score(xgb_tuned)
#Creating confusion matrix
make_confusion_matrix(xgb_tuned,y_test)
Accuracy on training set : 0.9706546275395034 Accuracy on test set : 0.9519578808818691 Recall on training set : 0.9657594381035997 Recall on test set : 0.9036885245901639 Precision on training set : 0.8668242710795903 Precision on test set : 0.8166666666666667 F1-Score on training set : 0.9136212624584718 F1-Score on test set : 0.8579766536964981
# defining list of models
models = [ab_classifier, abc_tuned, gb_classifier, gbc_tuned, xgb_classifier,xgb_tuned]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model,False)
acc_train.append(np.round(j[0],2))
acc_test.append(np.round(j[1],2))
recall_train.append(np.round(j[2],2))
recall_test.append(np.round(j[3],2))
precision_train.append(np.round(j[4],2))
precision_test.append(np.round(j[5],2))
f1_train.append(np.round(j[6],2))
f1_test.append(np.round(j[7],2))
comparison_frame = pd.DataFrame({'Model':['AdaBoost Classifier','Tuned AdaBoost Classifier',
'Gradient Boosting Classifier', 'Tuned Gradient Boosting Classifier',
'XGBoost Classifier', 'Tuned XGBoost Classifier'],
'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
'Train_Recall':recall_train,'Test_Recall':recall_test,
'Train_Precision':precision_train,'Test_Precision':precision_test,
'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})
#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | Train_F1-Score | Test_F1-Score | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | Tuned XGBoost Classifier | 0.97 | 0.95 | 0.97 | 0.90 | 0.87 | 0.82 | 0.91 | 0.86 |
| 4 | XGBoost Classifier | 1.00 | 0.96 | 1.00 | 0.87 | 1.00 | 0.90 | 1.00 | 0.89 |
| 3 | Tuned Gradient Boosting Classifier | 0.99 | 0.97 | 0.95 | 0.86 | 0.97 | 0.92 | 0.96 | 0.89 |
| 1 | Tuned AdaBoost Classifier | 0.98 | 0.96 | 0.92 | 0.83 | 0.95 | 0.90 | 0.94 | 0.86 |
| 2 | Gradient Boosting Classifier | 0.98 | 0.96 | 0.89 | 0.82 | 0.95 | 0.92 | 0.92 | 0.86 |
| 0 | AdaBoost Classifier | 0.96 | 0.95 | 0.86 | 0.78 | 0.90 | 0.88 | 0.88 | 0.83 |
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(x.columns)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
X = TB.drop(["Flag"], axis=1)
y = TB["Flag"]
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(7088, 15) (3039, 15)
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Gender 0 Education 0 Income 0 Card 0 Products_Held 0 Months_Inactive 0 Contacts 0 Balance 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 Married 0 Single 0 dtype: int64 ------------------------------ Gender 0 Education 0 Income 0 Card 0 Products_Held 0 Months_Inactive 0 Contacts 0 Balance 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 Married 0 Single 0 dtype: int64
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"Logistic Regression",
Pipeline(
steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1)),
]
),
)
)
models.append(
(
"Random Forest",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"Gradient Boosting",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"AdaBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1, eval_metric='logloss')),
]
),
)
)
models.append(
(
"Decision Tree",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
Logistic Regression: 57.946904706700664 Random Forest: 79.54130922018703 Gradient Boosting: 82.52685678955099 AdaBoost: 82.9673854239122 XGBoost: 85.33735219105031 Decision Tree: 77.87348326764047
# create a table for comparison purposes
Table1 = pd.Series([57.95,79.54,82.53,82.97,85.34,77.87],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance of the top three methods methods.
First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
data_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(data_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data was: 85.34
%%time
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), XGBClassifier(random_state=1, eval_metric="logloss")
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"xgbclassifier__n_estimators": np.arange(50, 300, 500),
"xgbclassifier__scale_pos_weight": [0, 1, 2, 5, 10],
"xgbclassifier__learning_rate": [0.01, 0.1, 0.2, 0.05],
"xgbclassifier__gamma": [0, 1, 3, 5],
"xgbclassifier__subsample": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
Best parameters are {'xgbclassifier__gamma': 3, 'xgbclassifier__learning_rate': 0.1, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__subsample': 0.7} with CV score=0.9499574928510703:
Wall time: 7min 14s
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.9,
learning_rate=0.01,
gamma=3,
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
[01:33:10] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=3, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned1)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned1, y_test)
Accuracy on training set : 0.9160553047404063 Accuracy on test set : 0.8914116485686081 Recall on training set : 0.9841966637401229 Recall on test set : 0.9385245901639344 Precision on training set : 0.660188457008245 Precision on test set : 0.604221635883905
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in RandomizedSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 0.9, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__n_estimators': 200, 'xgbclassifier__learning_rate': 0.01, 'xgbclassifier__gamma': 1} with CV score=0.9446904706700673:
Wall time: 1min 58s
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=20,
scale_pos_weight=10,
learning_rate=0.01,
gamma=1,
subsample=0.9,
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
[17:56:03] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=1, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=20,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned2)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned2, y_test)
Accuracy on training set : 0.9095654627539503 Accuracy on test set : 0.8831852583086541 Recall on training set : 0.9841966637401229 Recall on test set : 0.9282786885245902 Precision on training set : 0.6427752293577982 Precision on test set : 0.5860284605433377
AdaBoost was our second best performing model prior to hypertuning. The cross-validation score on the train data was: 82.97
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1), 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__n_estimators': 40}
Score: 0.8542777648968235
Wall time: 2min 8s
# Creating new pipeline with best parameters
abc_tuned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned1)
# Creating confusion matrix
make_confusion_matrix(abc_tuned1, y_test)
Accuracy on training set : 0.9877257336343115 Accuracy on test set : 0.9647910496873972 Recall on training set : 0.9631255487269534 Recall on test set : 0.8913934426229508 Precision on training set : 0.9605954465849387 Precision on test set : 0.8895705521472392
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
abc_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
abc_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(abc_tuned2.best_params_,abc_tuned2.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 90, 'adaboostclassifier__learning_rate': 0.2, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8525465646495093:
Wall time: 2min 6s
# Creating new pipeline with best parameters
abc_tuned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned2)
# Creating confusion matrix
make_confusion_matrix(abc_tuned2, y_test)
Accuracy on training set : 0.9877257336343115 Accuracy on test set : 0.9647910496873972 Recall on training set : 0.9631255487269534 Recall on test set : 0.8913934426229508 Precision on training set : 0.9605954465849387 Precision on test set : 0.8895705521472392
Gradient Boosting was our third best performing model prior to hypertuning. The cross-validation score on the train data was: 82.53
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'gradientboostingclassifier__max_features': 0.9, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__subsample': 1}
Score: 0.8551665507380788
Wall time: 1min 26s
# Creating new pipeline with best parameters
gbc_tuned1 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
random_state=1,
n_estimators=250,
subsample=1.0,
max_features=0.9,
),
)
# Fit the model on training data
gbc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(max_features=0.9, n_estimators=250,
random_state=1))])
# Calculating different metrics
get_metrics_score(gbc_tuned1)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned1, y_test)
Accuracy on training set : 0.9849040632054176 Accuracy on test set : 0.9700559394537677 Recall on training set : 0.935908691834943 Recall on test set : 0.8954918032786885 Precision on training set : 0.9690909090909091 Precision on test set : 0.9161425576519916
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbc_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
gbc_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(gbc_tuned2.best_params_,gbc_tuned2.best_score_))
Best parameters are {'gradientboostingclassifier__subsample': 1, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__max_features': 0.9} with CV score=0.8551665507380788:
Wall time: 4min 19s
# Creating new pipeline with best parameters
gbc_tuned2 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
random_state=1,
n_estimators=250,
subsample=1,
max_features=0.9,
),
)
# Fit the model on training data
gbc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(max_features=0.9, n_estimators=250,
random_state=1, subsample=1))])
# Calculating different metrics
get_metrics_score(gbc_tuned2)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned2, y_test)
Accuracy on training set : 0.9849040632054176 Accuracy on test set : 0.9700559394537677 Recall on training set : 0.935908691834943 Recall on test set : 0.8954918032786885 Precision on training set : 0.9690909090909091 Precision on test set : 0.9161425576519916
# defining list of models
models = [xgb_tuned1, xgb_tuned2, abc_tuned1, abc_tuned2, gbc_tuned1, gbc_tuned2 ]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model, False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"XGBoost tuned with GridSearchCV",
"XGBoost tuned with RandomizedSearchCV",
"AdaBoost tuned with GridSearchCV",
"AdaBoost tuned with RandomizedSearchCV",
"Gradient Boosting tuned with GridSearchCV",
"Gradient Boosting tuned with RandomizedSearchCV"
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | XGBoost tuned with GridSearchCV | 0.916055 | 0.891412 | 0.984197 | 0.938525 | 0.660188 | 0.604222 |
| 1 | XGBoost tuned with RandomizedSearchCV | 0.909565 | 0.883185 | 0.984197 | 0.928279 | 0.642775 | 0.586028 |
| 4 | Gradient Boosting tuned with GridSearchCV | 0.984904 | 0.970056 | 0.935909 | 0.895492 | 0.969091 | 0.916143 |
| 5 | Gradient Boosting tuned with RandomizedSearchCV | 0.984904 | 0.970056 | 0.935909 | 0.895492 | 0.969091 | 0.916143 |
| 2 | AdaBoost tuned with GridSearchCV | 0.987726 | 0.964791 | 0.963126 | 0.891393 | 0.960595 | 0.889571 |
| 3 | AdaBoost tuned with RandomizedSearchCV | 0.987726 | 0.964791 | 0.963126 | 0.891393 | 0.960595 | 0.889571 |
Feature Importance
We can check the feature importances of the top model: XGBoost with GridSearchCV
feature_names = X_train.columns
importances = xgb_tuned1[1].feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Here, we will attempt to build upon the best model but using alternative approaches to cleaning the data and other preprocessing teqchniques to see if we can further improve the model. In Approach 4, we will use knn imputer to determine the best replacement values for the 'unknown' values in the columns: Education, Marital Status and Income
KNNImputer: Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.TB2.head() #We can now recall TB2 dataset, where the the unknowns in Education and Income are already -1 and Marital Status is still unchanged
| Unnamed: 0 | Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 45 | 0 | 3 | 1 | Married | 2 | 0 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 1 | 0 | 49 | 1 | 5 | 3 | Single | 0 | 0 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 2 | 0 | 51 | 0 | 3 | 3 | Married | 3 | 0 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 3 | 0 | 40 | 1 | 4 | 1 | Unknown | 0 | 0 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 4 | 0 | 40 | 0 | 3 | 0 | Married | 2 | 0 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
TB2.drop(['Unnamed: 0'], axis=1, inplace = True)
# Encoding variables - Marital_Status
TB2["Marital_Status"] = TB2["Marital_Status"].map({"Married": 1, "Single": 2, "Divorced": 3, "Unknown": -1})
# Encoding variables - ''unknown' values were previously replaced by -1.... we will now encode them as NaN
TB2['Education'] = TB2['Education'].replace(-1, np.nan)
TB2['Income'] = TB2['Income'].replace(-1, np.nan)
TB2['Marital_Status'] = TB2['Marital_Status'].replace(-1, np.nan)
TB2.isnull().sum() # checks to see if it works (values should be: Marital Status = 749, Education = 1519, Income = 1112)
Flag 0 Age 0 Gender 0 Dependents 0 Education 1519 Marital_Status 749 Income 1112 Card 0 Months 0 Products_Held 0 Months_Inactive 0 Contacts 0 Credit_Limit 0 Balance 0 Ave_Credit_Line 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 dtype: int64
X = TB2.drop(["Flag"], axis=1)
y = TB2["Flag"]
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(7088, 19) (3039, 19)
# Creating a list of columns with missing values
reqd_col_for_impute = ["Marital_Status", "Education", "Income"]
imputer = KNNImputer(n_neighbors=5)
# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Transform the test data
X_test[reqd_col_for_impute] = imputer.transform(X_test[reqd_col_for_impute])
# As KNNImputer replaces the missing value with the mean of K nearest neighbours, we will roundoff those values
X_train[reqd_col_for_impute] = np.round(X_train[reqd_col_for_impute])
X_test[reqd_col_for_impute] = np.round(X_test[reqd_col_for_impute])
Marital_Status is an unordered, categorical variable, however, now that the names have been reassigned as numbers, python will think that Divorce = 3 is greater than Single = 2 is greater than Married = 1. We can now convert them back and use OneHotEncoding. Any transformations to the dataset have to be done separately now to prevent data leakage, since the data has already been split. Therefore, we will first transform the names in X_train, then again in X_test...
X_train['Marital_Status'] = X_train['Marital_Status'].map({1: 'Married', 2: 'Single', 3: 'Divorced'
})
#Reverts to their original names
X_train["Marital_Status"].value_counts()
Married 3503 Single 3083 Divorced 502 Name: Marital_Status, dtype: int64
X_test['Marital_Status'] = X_test['Marital_Status'].map({1: 'Married', 2: 'Single', 3: 'Divorced'
})
#Reverts to their original names
X_test["Marital_Status"].value_counts()
Married 1499 Single 1294 Divorced 246 Name: Marital_Status, dtype: int64
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
X_train.head()
| Age | Gender | Dependents | Education | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Marital_Status_Married | Marital_Status_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4124 | 50 | 1 | 1 | 3.0 | 1.0 | 0 | 43 | 6 | 1 | 2 | 7985.0 | 0 | 7985.0 | 1.032 | 3873 | 72 | 0.674 | 0.000 | 1 | 0 |
| 4686 | 50 | 0 | 0 | 1.0 | 2.0 | 0 | 36 | 3 | 3 | 2 | 5444.0 | 2499 | 2945.0 | 0.468 | 4509 | 80 | 0.667 | 0.459 | 0 | 0 |
| 1276 | 26 | 1 | 0 | 3.0 | 1.0 | 0 | 13 | 6 | 3 | 4 | 1643.0 | 1101 | 542.0 | 0.713 | 2152 | 50 | 0.471 | 0.670 | 0 | 1 |
| 6119 | 65 | 1 | 0 | 2.0 | 0.0 | 0 | 55 | 3 | 3 | 0 | 2022.0 | 0 | 2022.0 | 0.579 | 4623 | 65 | 0.548 | 0.000 | 0 | 1 |
| 2253 | 46 | 0 | 3 | 3.0 | 3.0 | 0 | 35 | 6 | 3 | 4 | 4930.0 | 0 | 4930.0 | 1.019 | 3343 | 77 | 0.638 | 0.000 | 0 | 1 |
X_test.head()
| Age | Gender | Dependents | Education | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Marital_Status_Married | Marital_Status_Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7403 | 38 | 0 | 0 | 2.0 | 2.0 | 0 | 26 | 5 | 2 | 1 | 3809.0 | 1521 | 2288.0 | 0.692 | 4666 | 69 | 0.865 | 0.399 | 1 | 0 |
| 2005 | 39 | 0 | 2 | 0.0 | 4.0 | 0 | 26 | 2 | 3 | 4 | 8906.0 | 0 | 8906.0 | 0.315 | 809 | 15 | 0.250 | 0.000 | 1 | 0 |
| 8270 | 45 | 0 | 4 | 5.0 | 3.0 | 0 | 39 | 2 | 3 | 2 | 1438.3 | 1162 | 276.3 | 0.539 | 4598 | 86 | 0.623 | 0.808 | 1 | 0 |
| 646 | 41 | 0 | 3 | 3.0 | 3.0 | 0 | 26 | 4 | 3 | 2 | 11806.0 | 1811 | 9995.0 | 0.754 | 1465 | 31 | 0.476 | 0.153 | 0 | 1 |
| 1690 | 65 | 1 | 1 | 2.0 | 1.0 | 0 | 48 | 4 | 2 | 4 | 4599.0 | 637 | 3962.0 | 0.622 | 2608 | 78 | 0.592 | 0.139 | 0 | 1 |
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Age 0 Gender 0 Dependents 0 Education 0 Income 0 Card 0 Months 0 Products_Held 0 Months_Inactive 0 Contacts 0 Credit_Limit 0 Balance 0 Ave_Credit_Line 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 Marital_Status_Married 0 Marital_Status_Single 0 dtype: int64 ------------------------------ Age 0 Gender 0 Dependents 0 Education 0 Income 0 Card 0 Months 0 Products_Held 0 Months_Inactive 0 Contacts 0 Credit_Limit 0 Balance 0 Ave_Credit_Line 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 Marital_Status_Married 0 Marital_Status_Single 0 dtype: int64
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"Logistic Regression",
Pipeline(
steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1)),
]
),
)
)
models.append(
(
"Random Forest",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"Gradient Boosting",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"AdaBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1, eval_metric='logloss')),
]
),
)
)
models.append(
(
"Decision Tree",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
Logistic Regression: 58.64981837854548 Random Forest: 80.33232861890409 Gradient Boosting: 84.01924414560632 AdaBoost: 83.670299095757 XGBoost: 86.9178452739779 Decision Tree: 79.62941494705926
# create a table for comparison purposes
Table2 = pd.Series([58.65,80.33,84.02,83.67,86.92,79.63],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
Did this approach improve the cross-validation scores? Lets compare...
# create a table for comparison purposes
Table1 = pd.Series([57.95,79.54,82.53,82.97,85.34,77.87],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
CT = pd.concat([Table1,Table2],axis=1,sort=False)
CT.columns=['Manual Impute','Knn Imputer']
print('\033[1m' + 'Cross Validation Scores (model averages)')
CT
Cross Validation Scores (model averages)
| Manual Impute | Knn Imputer | |
|---|---|---|
| Logisitc Regression | 57.95 | 58.65 |
| Random Forest | 79.54 | 80.33 |
| Gradient Boosting | 82.53 | 84.02 |
| AdaBoost | 82.97 | 83.67 |
| XGBoost | 85.34 | 86.92 |
| Decision Tree | 77.87 | 79.63 |
The table shows that, across the board, the knn imputer is giving higher cross-validation recall scores. This could mean that knn imputer is making some of the unknown variables more correlated, but it could also mean that it is more exact than our manual impute approach. Afterall, there were assumptions made when we imputed unknown values manually. We will test the tree best models... XGBoost, Gradient Boosting and AdaBoost to check for overfitting.
We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance of the top three methods methods.
First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
data_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(data_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data after using knn imputer was: 86.82
%%time
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), XGBClassifier(random_state=1, eval_metric="logloss")
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"xgbclassifier__n_estimators": np.arange(50, 300, 500),
"xgbclassifier__scale_pos_weight": [0, 1, 2, 5, 10],
"xgbclassifier__learning_rate": [0.01, 0.1, 0.2, 0.05],
"xgbclassifier__gamma": [0, 1, 3, 5],
"xgbclassifier__subsample": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
Best parameters are {'xgbclassifier__gamma': 3, 'xgbclassifier__learning_rate': 0.1, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__subsample': 0.7} with CV score=0.9482108354586908:
Wall time: 6min 31s
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.7,
learning_rate=0.01,
gamma=3,
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
[00:05:40] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=3, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.7, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned1)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned1, y_test)
Accuracy on training set : 0.9248024830699775 Accuracy on test set : 0.9009542612701547 Recall on training set : 0.9833187006145742 Recall on test set : 0.9323770491803278 Precision on training set : 0.6854345165238678 Precision on test set : 0.6293222683264177
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in RandomizedSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 1, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__learning_rate': 0.05, 'xgbclassifier__gamma': 5} with CV score=0.9464409923487132:
Wall time: 2min 25s
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=20,
scale_pos_weight=10,
learning_rate=0.01,
gamma=1,
subsample=0.9,
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
[00:08:07] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=1, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=20,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned2)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned2, y_test)
Accuracy on training set : 0.9137979683972912 Accuracy on test set : 0.8894373149062191 Recall on training set : 0.9850746268656716 Recall on test set : 0.9221311475409836 Precision on training set : 0.6538461538461539 Precision on test set : 0.6016042780748663
Gradient Boosting was our second best performing model for knn imputed data prior to hypertuning. The cross-validation score on the train data was: 84.02
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'gradientboostingclassifier__max_features': 0.7, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__subsample': 1}
Score: 0.8771118324445475
Wall time: 2min 3s
# Creating new pipeline with best parameters
gbc_tuned1 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
random_state=1,
n_estimators=250,
subsample=1.0,
max_features=0.9,
),
)
# Fit the model on training data
gbc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(max_features=0.9, n_estimators=250,
random_state=1))])
# Calculating different metrics
get_metrics_score(gbc_tuned1)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned1, y_test)
Accuracy on training set : 0.9894187358916479 Accuracy on test set : 0.9746627179993419 Recall on training set : 0.9525899912203687 Recall on test set : 0.9036885245901639 Precision on training set : 0.9810126582278481 Precision on test set : 0.9363057324840764
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"gradientboostingclassifier__n_estimators": [100,150,200,250],
"gradientboostingclassifier__subsample":[0.8,0.9,1],
"gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
gbc_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
gbc_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(gbc_tuned2.best_params_,gbc_tuned2.best_score_))
Best parameters are {'gradientboostingclassifier__subsample': 1, 'gradientboostingclassifier__n_estimators': 250, 'gradientboostingclassifier__max_features': 0.7} with CV score=0.8771118324445475:
Wall time: 6min 5s
# Creating new pipeline with best parameters
gbc_tuned2 = make_pipeline(
StandardScaler(),
GradientBoostingClassifier(
random_state=1,
n_estimators=250,
subsample=1,
max_features=0.9,
),
)
# Fit the model on training data
gbc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('gradientboostingclassifier',
GradientBoostingClassifier(max_features=0.9, n_estimators=250,
random_state=1, subsample=1))])
# Calculating different metrics
get_metrics_score(gbc_tuned2)
# Creating confusion matrix
make_confusion_matrix(gbc_tuned2, y_test)
Accuracy on training set : 0.9894187358916479 Accuracy on test set : 0.9746627179993419 Recall on training set : 0.9525899912203687 Recall on test set : 0.9036885245901639 Precision on training set : 0.9810126582278481 Precision on test set : 0.9363057324840764
AdaBoost was our third best performing model prior to hypertuning. The cross-validation score on the train data was: 83.67
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in GridSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1), 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__n_estimators': 100}
Score: 0.8780006182858028
Wall time: 2min 52s
# Creating new pipeline with best parameters
abc_tuned1 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned1.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned1)
# Creating confusion matrix
make_confusion_matrix(abc_tuned1, y_test)
Accuracy on training set : 0.9954853273137697 Accuracy on test set : 0.9740046067785456 Recall on training set : 0.9850746268656716 Recall on test set : 0.9262295081967213 Precision on training set : 0.9868073878627969 Precision on test set : 0.9131313131313131
%%time
# Creating pipeline
pipe = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))
# Parameter grid to pass in RandomizedSearchCV
param_grid = {
"adaboostclassifier__n_estimators": np.arange(10, 110, 10),
"adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"adaboostclassifier__base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
abc_tuned2 = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
abc_tuned2.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(abc_tuned2.best_params_,abc_tuned2.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 100, 'adaboostclassifier__learning_rate': 1, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.8780006182858028:
Wall time: 2min 52s
# Creating new pipeline with best parameters
abc_tuned2 = make_pipeline(
StandardScaler(),
AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=2, random_state=1),
n_estimators=100,
learning_rate=1,
random_state=1,
),
)
# Fit the model on training data
abc_tuned2.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('adaboostclassifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
random_state=1),
learning_rate=1, n_estimators=100,
random_state=1))])
# Calculating different metrics
get_metrics_score(abc_tuned2)
# Creating confusion matrix
make_confusion_matrix(abc_tuned2, y_test)
Accuracy on training set : 0.9954853273137697 Accuracy on test set : 0.9740046067785456 Recall on training set : 0.9850746268656716 Recall on test set : 0.9262295081967213 Precision on training set : 0.9868073878627969 Precision on test set : 0.9131313131313131
# defining list of models
models = [xgb_tuned1, xgb_tuned2, abc_tuned1, abc_tuned2, gbc_tuned1, gbc_tuned2 ]
# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:
j = get_metrics_score(model, False)
acc_train.append(j[0])
acc_test.append(j[1])
recall_train.append(j[2])
recall_test.append(j[3])
precision_train.append(j[4])
precision_test.append(j[5])
comparison_frame = pd.DataFrame(
{
"Model": [
"XGBoost tuned with GridSearchCV",
"XGBoost tuned with RandomizedSearchCV",
"AdaBoost tuned with GridSearchCV",
"AdaBoost tuned with RandomizedSearchCV",
"Gradient Boosting tuned with GridSearchCV",
"Gradient Boosting tuned with RandomizedSearchCV"
],
"Train_Accuracy": acc_train,
"Test_Accuracy": acc_test,
"Train_Recall": recall_train,
"Test_Recall": recall_test,
"Train_Precision": precision_train,
"Test_Precision": precision_test,
}
)
# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
| Model | Train_Accuracy | Test_Accuracy | Train_Recall | Test_Recall | Train_Precision | Test_Precision | |
|---|---|---|---|---|---|---|---|
| 0 | XGBoost tuned with GridSearchCV | 0.924802 | 0.900954 | 0.983319 | 0.932377 | 0.685435 | 0.629322 |
| 2 | AdaBoost tuned with GridSearchCV | 0.995485 | 0.974005 | 0.985075 | 0.926230 | 0.986807 | 0.913131 |
| 3 | AdaBoost tuned with RandomizedSearchCV | 0.995485 | 0.974005 | 0.985075 | 0.926230 | 0.986807 | 0.913131 |
| 1 | XGBoost tuned with RandomizedSearchCV | 0.913798 | 0.889437 | 0.985075 | 0.922131 | 0.653846 | 0.601604 |
| 4 | Gradient Boosting tuned with GridSearchCV | 0.989419 | 0.974663 | 0.952590 | 0.903689 | 0.981013 | 0.936306 |
| 5 | Gradient Boosting tuned with RandomizedSearchCV | 0.989419 | 0.974663 | 0.952590 | 0.903689 | 0.981013 | 0.936306 |
Here, we will attempt to build upon the best model but using alternative approaches to cleaning the data and other preprocessing teqchniques to see if we can further improve the model.
In Approach B, we will leave the unknown values as a separate category. When using OneHotEncoder to transform categorical columns to 'on' or 'off', it is customary to drop one column. This is because whichever column is dropped is assumed to be 'on' if and only if all other columns are 'off'. In this case, we will drop the unknown columns.
Note: Here, we will use dataset TB3, which was only partially cleaned (column names) and put aside earlier for this purpose. We can take a quick look at the shape of the dataset...
TB3.head()
| Flag | Age | Gender | Dependents | Education | Marital_Status | Income | Card | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | Unknown | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
TB3.shape
(10127, 20)
The first step is to convert the dependent variable to binary, after which, we can OneHotEncode the categorical variables...
#Map Existing Customer = 1; Attrited Customer = 0 for Logistic modelling
codes = {'Existing Customer':0, 'Attrited Customer':1}
TB3['Flag'] = TB3['Flag'].map(codes)
#OneHotEncodes all the categorical variables...
TB3.Gender = TB3.Gender.replace({'F':1,'M':0})
TB3 = pd.concat([TB3,pd.get_dummies(TB3['Education']).drop(columns=['Unknown'])],axis=1)
TB3 = pd.concat([TB3,pd.get_dummies(TB3['Income']).drop(columns=['Unknown'])],axis=1)
TB3 = pd.concat([TB3,pd.get_dummies(TB3['Marital_Status']).drop(columns=['Unknown'])],axis=1)
TB3 = pd.concat([TB3,pd.get_dummies(TB3['Card']).drop(columns=['Platinum'])],axis=1)
TB3.drop(columns = ['Education','Income','Marital_Status','Card'],inplace=True)
TB3.head()
| Flag | Age | Gender | Dependents | Months | Products_Held | Months_Inactive | Contacts | Credit_Limit | Balance | Ave_Credit_Line | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | College | Doctorate | Graduate | High School | Post-Graduate | Uneducated | $120K + | $40K - $60K | $60K - $80K | $80K - $120K | Less than $40K | Divorced | Married | Single | Blue | Gold | Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | 0 | 3 | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 1 | 0 | 49 | 1 | 5 | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
| 2 | 0 | 51 | 0 | 3 | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 40 | 1 | 4 | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 40 | 0 | 3 | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
X = TB3.drop('Flag',errors='ignore',axis=1)
y = TB3['Flag']
# Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(7088, 32) (3039, 32)
# Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Age 0 Gender 0 Dependents 0 Months 0 Products_Held 0 Months_Inactive 0 Contacts 0 Credit_Limit 0 Balance 0 Ave_Credit_Line 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 College 0 Doctorate 0 Graduate 0 High School 0 Post-Graduate 0 Uneducated 0 $120K + 0 $40K - $60K 0 $60K - $80K 0 $80K - $120K 0 Less than $40K 0 Divorced 0 Married 0 Single 0 Blue 0 Gold 0 Silver 0 dtype: int64 ------------------------------ Age 0 Gender 0 Dependents 0 Months 0 Products_Held 0 Months_Inactive 0 Contacts 0 Credit_Limit 0 Balance 0 Ave_Credit_Line 0 Trans_Changes 0 Trans_Totals 0 Trans_Count 0 Count_Changes 0 Ratio 0 College 0 Doctorate 0 Graduate 0 High School 0 Post-Graduate 0 Uneducated 0 $120K + 0 $40K - $60K 0 $60K - $80K 0 $80K - $120K 0 Less than $40K 0 Divorced 0 Married 0 Single 0 Blue 0 Gold 0 Silver 0 dtype: int64
models = [] # Empty list to store all the models
# Appending pipelines for each model into the list
models.append(
(
"Logistic Regression",
Pipeline(
steps=[
("scaler", StandardScaler()),
("log_reg", LogisticRegression(random_state=1)),
]
),
)
)
models.append(
(
"Random Forest",
Pipeline(
steps=[
("scaler", StandardScaler()),
("random_forest", RandomForestClassifier(random_state=1)),
]
),
)
)
models.append(
(
"Gradient Boosting",
Pipeline(
steps=[
("scaler", StandardScaler()),
("gradient_boosting", GradientBoostingClassifier(random_state=1)),
]
),
)
)
models.append(
(
"AdaBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("adaboost", AdaBoostClassifier(random_state=1)),
]
),
)
)
models.append(
(
"XGBoost",
Pipeline(
steps=[
("scaler", StandardScaler()),
("xgboost", XGBClassifier(random_state=1, eval_metric='logloss')),
]
),
)
)
models.append(
(
"Decision Tree",
Pipeline(
steps=[
("scaler", StandardScaler()),
("decision_tree", DecisionTreeClassifier(random_state=1)),
]
),
)
)
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
Logistic Regression: 58.64981837854548 Random Forest: 76.38225519746503 Gradient Boosting: 84.10773630110519 AdaBoost: 83.84457840636837 XGBoost: 87.35605533657933 Decision Tree: 78.39941262848751
# create a table for comparison purposes
Table3 = pd.Series([58.65,76.38,84.11,83.84,87.36,78.40],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure()
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# create a table for comparison purposes
Table1 = pd.Series([57.95,79.54,82.53,82.97,85.34,77.87],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
# create a table for comparison purposes
Table2 = pd.Series([58.65,80.33,84.02,83.67,86.92,79.63],
['Logisitc Regression', 'Random Forest', 'Gradient Boosting', 'AdaBoost', 'XGBoost', 'Decision Tree'])
CT = pd.concat([Table1,Table2,Table3],axis=1,sort=False)
CT.columns=['Manual Impute','Knn Imputer', 'OneHotEncoder']
print('\033[1m' + 'Cross Validation Scores (model averages)')
CT
Cross Validation Scores (model averages)
| Manual Impute | Knn Imputer | OneHotEncoder | |
|---|---|---|---|
| Logisitc Regression | 57.95 | 58.65 | 58.65 |
| Random Forest | 79.54 | 80.33 | 76.38 |
| Gradient Boosting | 82.53 | 84.02 | 84.11 |
| AdaBoost | 82.97 | 83.67 | 83.84 |
| XGBoost | 85.34 | 86.92 | 87.36 |
| Decision Tree | 77.87 | 79.63 | 78.40 |
We can see that on the best three models, the OneHotEncoder method actually has higher cross-validation scores
We will use pipelines with StandardScaler to tune the model using GridSearchCV and RandomizedSearchCV. In this instance, we will only compare the top model - XGBoost - since we know it outperforms the other models.
First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.
## Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
"""
model : classifier to predict values of X
"""
# defining an empty list to store train and test results
score_list = []
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)
train_recall = metrics.recall_score(y_train, pred_train)
test_recall = metrics.recall_score(y_test, pred_test)
train_precision = metrics.precision_score(y_train, pred_train)
test_precision = metrics.precision_score(y_test, pred_test)
score_list.extend(
(
train_acc,
test_acc,
train_recall,
test_recall,
train_precision,
test_precision,
)
)
# If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
if flag == True:
print("Accuracy on training set : ", model.score(X_train, y_train))
print("Accuracy on test set : ", model.score(X_test, y_test))
print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
print(
"Precision on training set : ", metrics.precision_score(y_train, pred_train)
)
print("Precision on test set : ", metrics.precision_score(y_test, pred_test))
return score_list # returning the list with train and test scores
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
"""
model : classifier to predict values of X
y_actual : ground truth
"""
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=[0, 1])
data_cm = pd.DataFrame(
cm,
index=[i for i in ["Actual - No", "Actual - Yes"]],
columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
)
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(data_cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
XGBoost was our top performing model prior to hypertuning. The cross-validation score on the train data after using knn imputer was: 86.82
%%time
# Creating pipeline
pipe = make_pipeline(
StandardScaler(), XGBClassifier(random_state=1, eval_metric="logloss")
)
# Parameter grid to pass in GridSearchCV
param_grid = {
"xgbclassifier__n_estimators": np.arange(50, 300, 500),
"xgbclassifier__scale_pos_weight": [0, 1, 2, 5, 10],
"xgbclassifier__learning_rate": [0.01, 0.1, 0.2, 0.05],
"xgbclassifier__gamma": [0, 1, 3, 5],
"xgbclassifier__subsample": [0.7, 0.8, 0.9, 1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)
# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)
print(
"Best parameters are {} with CV score={}:".format(
grid_cv.best_params_, grid_cv.best_score_
)
)
Best parameters are {'xgbclassifier__gamma': 3, 'xgbclassifier__learning_rate': 0.1, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__subsample': 0.7} with CV score=0.9499574928510703:
Wall time: 7min 12s
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
StandardScaler(),
XGBClassifier(
random_state=1,
n_estimators=50,
scale_pos_weight=10,
subsample=0.7,
learning_rate=0.01,
gamma=3,
),
)
# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
[00:29:46] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('standardscaler', StandardScaler()),
('xgbclassifier',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=3, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=50,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.7, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned1)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned1, y_test)
Accuracy on training set : 0.9250846501128668 Accuracy on test set : 0.9032576505429417 Recall on training set : 0.9841966637401229 Recall on test set : 0.9364754098360656 Precision on training set : 0.686046511627907 Precision on test set : 0.6347222222222222
%%time
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric='logloss'))
#Parameter grid to pass in RandomizedSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[0,1,2,5,10],
'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
'xgbclassifier__subsample':[0.7,0.8,0.9,1]}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 0.9, 'xgbclassifier__scale_pos_weight': 10, 'xgbclassifier__n_estimators': 200, 'xgbclassifier__learning_rate': 0.05, 'xgbclassifier__gamma': 5} with CV score=0.9482069711724245:
Wall time: 3min 2s
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
steps=[
("scaler", StandardScaler()),
(
"XGB",
XGBClassifier(
random_state=1,
n_estimators=20,
scale_pos_weight=10,
learning_rate=0.01,
gamma=1,
subsample=0.9,
),
),
]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
[00:32:49] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Pipeline(steps=[('scaler', StandardScaler()),
('XGB',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1, colsample_bynode=1,
colsample_bytree=1, gamma=1, gpu_id=-1,
importance_type='gain',
interaction_constraints='', learning_rate=0.01,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=20,
n_jobs=4, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=10,
subsample=0.9, tree_method='exact',
validate_parameters=1, verbosity=None))])
# Calculating different metrics
get_metrics_score(xgb_tuned2)
# Creating confusion matrix
make_confusion_matrix(xgb_tuned2, y_test)
Accuracy on training set : 0.9122460496613995 Accuracy on test set : 0.8897663705166173 Recall on training set : 0.9850746268656716 Recall on test set : 0.930327868852459 Precision on training set : 0.6496815286624203 Precision on test set : 0.6013245033112583
Approach 1
Approach 2
Approach 3
Approach 4
Approach 5
Business Applications
Note: Up until this point in the project, I have done everything to the best of my abilities to fulfill the assignment criteria. Below, I am able to improve the model even further by applying a Receiver Operating Characteristic for training data and finding optimal threshold. The optimal threshold was then applied and fitted to the test data, improving the Accuracy and Recall scores and reducing False Negatives further. Despite being outside the scope of the assignment, this application of a previously learnt technique on the best model found while completing the criteria of this assignment, was used to complete the adjacent report. The following was completed for my own personal learning benefit....
# Creates probability vectors from our best model on the test and train data which can then be used to plot a ROC Curve
prob_vector = xgb_tuned1.predict_proba(X_test)[:, 1]
prob_vector2 = xgb_tuned1.predict_proba(X_train)[:, 1]
# This code will plot a ROC Curve for training/validation data
y_probas = prob_vector
from sklearn.metrics import roc_curve, auc
fpr, tpr, _ = roc_curve(y_test, y_probas)
roc_auc = auc(fpr, tpr)
plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
lw=lw, label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for training data')
plt.legend(loc="lower right")
plt.show()
# We can further explore the plotted ROC curve and find best threshold to binarize the predictions from our best model
pred_y = prob_vector
# calculate roc curves
fpr, tpr, thresholds = roc_curve(y_test, pred_y)
# calculate the g-mean for each threshold
gmeans = np.sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = np.argmax(gmeans)
print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
# plot the roc curve for the model
plt.figure(num=0, figsize=[6.4, 4.8])
plt.plot([0,1], [0,1], linestyle='--', label='No Skill')
plt.plot(fpr, tpr, marker='.', label='Logistic')
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
# show the plot
plt.show()
Best Threshold=0.178499, G-Mean=0.961
The optimal model threshold is predicted to be about 0.18. Lets check the model scores while using the optimal threshold
optimal_threshold = 0.178499
# Model prediction with optimal threshold
pred_train_opt = (prob_vector2>optimal_threshold).astype(int)
pred_test_opt = (prob_vector>optimal_threshold).astype(int)
print('Accuracy on train data:',accuracy_score(y_train, pred_train_opt) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test_opt))
print('')
print('Recall on train data:',recall_score(y_train, pred_train_opt))
print('Recall on test data:',recall_score(y_test, pred_test_opt))
print('')
print('Precision on train data:',precision_score(y_train, pred_train_opt) )
print('Precision on test data:',precision_score(y_test, pred_test_opt))
print('')
print('f1 score on train data:',f1_score(y_train, pred_train_opt))
print('f1 score on test data:',f1_score(y_test, pred_test_opt))
Accuracy on train data: 0.9696670428893905 Accuracy on test data: 0.9549193813754524 Recall on train data: 0.9877085162423178 Recall on test data: 0.9672131147540983 Precision on train data: 0.8484162895927602 Precision on test data: 0.7959527824620574 f1 score on train data: 0.9127789046653144 f1 score on test data: 0.8732654949121185
def make_confusion_matrix_simple(y_actual,y_predict,labels=[1, 0]):
'''
y_predict: prediction of class
y_actual : ground truth
'''
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
columns = [i for i in ['Predicted - No','Predicted - Yes']])
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in
zip(group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=labels,fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.xlabel('Predicted label')
make_confusion_matrix_simple(y_test,pred_test_opt)
From our best model above, we see that there are 16 False Negatives. These are essentially misclassified predictions. The accounts of these customers have actually gone into attrition but the model predicted they would remain open. This is the biggest cost facing Thera Bank. We can examine the individual misclassified predictions using the technique below. Note: there are also 121 misclassified False Positives, but we are not concerned with these, so analysis of the False Positives will be skipped.
False Negatives - Test Data
TB_predict1 = pred_test_opt #Predictions on the best model Test data
TB5 = TB[TB.index.isin(X_test.index.values)].copy() # creates a dataframe, TB5, containing just the best model X_test data
X_test.head() #We know that the test data was randomly selected and randomly ordered, so lets look at the .head
| Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7403 | 0 | 3.0 | 2 | 0 | 5 | 2 | 1 | 1521 | 0.692 | 4666 | 69 | 0.865 | 0.399 | 1 | 0 |
| 2005 | 0 | 0.0 | 4 | 0 | 2 | 3 | 4 | 0 | 0.315 | 809 | 15 | 0.250 | 0.000 | 1 | 0 |
| 8270 | 0 | 5.0 | 3 | 0 | 2 | 3 | 2 | 1162 | 0.539 | 4598 | 86 | 0.623 | 0.808 | 1 | 0 |
| 646 | 0 | 3.0 | 3 | 0 | 4 | 3 | 2 | 1811 | 0.754 | 1465 | 31 | 0.476 | 0.153 | 0 | 1 |
| 1690 | 1 | 3.0 | 1 | 0 | 4 | 2 | 4 | 637 | 0.622 | 2608 | 78 | 0.592 | 0.139 | 0 | 1 |
TB5.loc[[7403,2005,8270,646,1690], :] #TB5 was not randomly ordered, so lets check and see if the data matches by calling the above rows
| Flag | Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7403 | 0 | 0 | 3.0 | 2 | 0 | 5 | 2 | 1 | 1521 | 0.692 | 4666 | 69 | 0.865 | 0.399 | 1 | 0 |
| 2005 | 1 | 0 | 0.0 | 4 | 0 | 2 | 3 | 4 | 0 | 0.315 | 809 | 15 | 0.250 | 0.000 | 1 | 0 |
| 8270 | 0 | 0 | 5.0 | 3 | 0 | 2 | 3 | 2 | 1162 | 0.539 | 4598 | 86 | 0.623 | 0.808 | 1 | 0 |
| 646 | 0 | 0 | 3.0 | 3 | 0 | 4 | 3 | 2 | 1811 | 0.754 | 1465 | 31 | 0.476 | 0.153 | 0 | 1 |
| 1690 | 0 | 1 | 3.0 | 1 | 0 | 4 | 2 | 4 | 637 | 0.622 | 2608 | 78 | 0.592 | 0.139 | 0 | 1 |
The data seems to match. We now are faced with the difficulty that the TB_predict1 values are also unordered. We can solve this by attaching TB_predict1 to the X_test data, sorting the data, and then dropping all the columns except for the Predicted column. This is a bit of a work around because TB_predict1 is not a dataframe, has no axis, and cannot be sorted on its own.
X_test2 = X_test.copy() # we will make a copy first, so that if it doesnt work, we will not have to re-run all the code
X_test2["Predicted"] = #this attaches the test prediction values onto the X_test2 data as a new column, 'Predicted'
X_test2 = X_test2.sort_index() # we can now sort the entire X_test2 column, with the new column 'Predicted' attached
X_test2 = X_test2[['Predicted']] # this essentially drops all the columns in the dataset except 'Predicted'
X_test2.head() # check to see if it sorted correctly
| Predicted | |
|---|---|
| 0 | 0 |
| 2 | 0 |
| 8 | 0 |
| 14 | 0 |
| 15 | 0 |
TB5.head(5) # does the index match the dataset we want to attach this column to? Lets see...
| Flag | Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1.0 | 2 | 0 | 5 | 1 | 3 | 777 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 1 | 0 |
| 2 | 0 | 0 | 3.0 | 3 | 0 | 4 | 1 | 0 | 0 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 1 | 0 |
| 8 | 0 | 0 | 0.0 | 2 | 0 | 5 | 2 | 0 | 2517 | 3.355 | 1350 | 24 | 1.182 | 0.113 | 0 | 1 |
| 14 | 0 | 1 | 3.0 | 0 | 0 | 5 | 2 | 2 | 680 | 1.190 | 1570 | 29 | 0.611 | 0.279 | 1 | 0 |
| 15 | 0 | 0 | 3.0 | 3 | 0 | 5 | 1 | 2 | 972 | 1.707 | 1348 | 27 | 1.700 | 0.230 | 0 | 0 |
TB5["Predicted"] = X_test2 # the indexes match, so we can attach the single column X_test2 dataframe to our full test data
TB5['Flag'].value_counts() # this will count the actual independent values from the dataset that were selected in the test set
0 2551 1 488 Name: Flag, dtype: int64
TB5['Predicted'].value_counts() # this will count the predicted values trained by the model and applied to the test set
0 2446 1 593 Name: Predicted, dtype: int64
We know that False Negatives, the predictions of interest, occur when the data shows the customer account is ACTUALLY closed, but the model Predicts that it is open. This information is stored in the dataset wherever Flag = 1 and Predicted = 0. We know from the confusion matrix above that this should occur 16 times. Lets create a workable dataset of just False Negatives, called FN1...
FN1 = TB5[(TB5['Flag'] == 1)&(TB5['Predicted'] == 0)] #Creates a dataframe of just the false negatives
FN1.shape #This code allows us to check the shape of the new dataset. There should be just 16 rows...
(16, 17)
FN1.head(16)
| Flag | Gender | Education | Income | Card | Products_Held | Months_Inactive | Contacts | Balance | Trans_Changes | Trans_Totals | Trans_Count | Count_Changes | Ratio | Married | Single | Predicted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1257 | 1 | 1 | 3.0 | 0 | 0 | 5 | 1 | 2 | 951 | 0.737 | 1096 | 27 | 0.588 | 0.174 | 0 | 1 | 0 |
| 1950 | 1 | 0 | 0.0 | 2 | 0 | 6 | 2 | 3 | 0 | 0.733 | 967 | 24 | 1.182 | 0.000 | 0 | 0 | 0 |
| 2109 | 1 | 0 | 3.0 | 4 | 0 | 3 | 3 | 0 | 0 | 0.995 | 1564 | 31 | 0.476 | 0.000 | 0 | 1 | 0 |
| 2411 | 1 | 0 | 0.0 | 4 | 0 | 5 | 2 | 4 | 787 | 0.763 | 1312 | 32 | 0.600 | 0.023 | 1 | 0 | 0 |
| 3810 | 1 | 0 | 1.0 | 2 | 0 | 6 | 3 | 3 | 0 | 0.432 | 1329 | 35 | 0.591 | 0.000 | 1 | 0 | 0 |
| 5660 | 1 | 0 | 3.0 | 3 | 0 | 3 | 3 | 3 | 321 | 0.760 | 2078 | 62 | 0.550 | 0.025 | 1 | 0 | 0 |
| 6125 | 1 | 0 | 0.0 | 3 | 2 | 6 | 3 | 2 | 1104 | 0.591 | 1989 | 47 | 0.741 | 0.032 | 0 | 1 | 0 |
| 6246 | 1 | 1 | 3.0 | 0 | 0 | 6 | 1 | 3 | 1647 | 0.513 | 1903 | 44 | 0.630 | 0.898 | 0 | 1 | 0 |
| 6284 | 1 | 1 | 4.0 | 0 | 0 | 5 | 2 | 3 | 0 | 0.865 | 2716 | 63 | 0.703 | 0.000 | 1 | 0 | 0 |
| 6434 | 1 | 0 | 0.0 | 3 | 0 | 4 | 2 | 3 | 0 | 0.500 | 2625 | 63 | 0.500 | 0.000 | 0 | 1 | 0 |
| 6667 | 1 | 0 | 3.0 | 1 | 0 | 5 | 3 | 2 | 2517 | 0.690 | 2516 | 64 | 0.641 | 0.435 | 1 | 0 | 0 |
| 7594 | 1 | 1 | 1.0 | 0 | 0 | 3 | 2 | 4 | 0 | 0.838 | 3121 | 56 | 0.867 | 0.000 | 0 | 1 | 0 |
| 7782 | 1 | 1 | 0.0 | 1 | 0 | 2 | 3 | 2 | 965 | 0.928 | 3300 | 58 | 0.568 | 0.306 | 0 | 0 | 0 |
| 8967 | 1 | 0 | 2.0 | 3 | 0 | 4 | 3 | 3 | 0 | 0.890 | 4520 | 63 | 0.370 | 0.000 | 0 | 0 | 0 |
| 9014 | 1 | 1 | 2.0 | 1 | 0 | 2 | 3 | 1 | 0 | 1.005 | 5242 | 74 | 0.574 | 0.000 | 0 | 1 | 0 |
| 9142 | 1 | 1 | 1.0 | 1 | 0 | 4 | 3 | 3 | 0 | 0.494 | 4399 | 54 | 0.862 | 0.000 | 0 | 1 | 0 |